\usepackage{float} \let\origfigure\figure \let\endorigfigure\endfigure \renewenvironment{figure}[1][2] { \expandafter\origfigure\expandafter[H] } { \endorigfigure }
Abstract
Statistics in sports is a growing field in research that provides specialized methodology for collecting and analyzing sports data in order to make decisions for successful planning and implementation of new strategies [1]. Sports, particularly, have countless and ever-expanding data sources that can be used by analysts in order to extract objective information for use in aspects such as making predictions throughout seasons and enhancements in team and player performance. Broadly, Sports Analysis is described as the process of data management, predictive model implementation, and the use of information systems for decision making to gain a competitive advantage.The data that has been collected comes from two sources: the Synergy database [2] and the OUA website [3]. It is game by game data for every team in the U Sports division in the Ontario University Athletics conference during the regular seasons from 2015/2016 to 2018/2019. The datasets extracted from Synergy contain insufficient data from the 2014/2015 season so only the data from 2015/2016 to 2018/2019 will be used from the data collected from Synergy. Six datasets were extracted per game and aggregated. These datasets are Play Types, Sets, Shots, Transitions, General Statistics, and Player Statistics. The data scraping functions that were used can be found in the appendix.
This dataset shows the number of different types of plays per game.
| Features | Description |
|---|---|
| ID | a unique ID to reference the games (row data) |
| Team | Name of the team |
| Season | the year of the regular season that the game took place |
| All Isolation | Number of possessions that have an isolation play |
| All Offensive Rebounds | Number of possessions that involve an offensive rebound |
| All P.R Ball Handler (BH) | Number of possessions involving a pick & roll ball handler |
| All Possessions | Number of possessions in a game |
| All Post-Up | Number of possessions involving a post-up play |
| Cuts | Number of possessions involving cuts |
| Handoffs | Number of possessions involving a handoff play |
| Isolation Defense Commits | A category of an isolation play where the defense commits |
| Isolation Single Covered | A category of an isolation play where no defense commits |
| Miscellaneous Possessions | Number of possessions that do not fit a certain category |
| Off Screens | Number of possessions involving an Off Screen |
| Offensive Rebound PutBack | A category of Offensive Rebounds where ball gets tipped in |
| Off Reb Reset Offense | A category of Offensive Rebounds where the offense resets |
| PR BH Defense Commits | A category of pick & roll BH where defense commit |
| PR BH Single Covered | A category of pick & roll BH where no defense commit |
| PR BH Traps | A category of pick & roll BH where the BH gets trapped |
| PR Roll Man | A category of pick & roll BH where the roll score |
| Post-Up Defense Commits | A category of Post-Ups where the defense commits |
| Post-Up Hard Double Team | A category of Post-Ups where a double team comes |
| Post-Up Single Covered | A category of Post-Ups where no defense commits |
| Spot Ups | Number of possessions with a Spot-Up play |
| Total Points | Total Number of Points for the game |
| Win | 1 indicating a win and 0 indicating a loss |
This dataset contains the number of possessions for every way the Offense sets up.
| Features | Description |
|---|---|
| ID | a unique ID to reference the games (row data) |
| Team | Name of the team |
| Season | the year of the regular season that the game took place |
| After Time Outs | Number of possessions that are after a time-out |
| Half Court Set All | Number of possessions when the offense is set |
| Half Court Set All No Pts | Number of possessions when the offense is set but no pts |
| Half Court Set All Pts | Number of possessions when the offense is set & pts scored |
| Half Court Set Vs Zone | Number of possessions when the offense set vs zone defense |
| Half Court SetVs.Zone Pts | Number of possessions when the offense is set vs zone&pts |
| Half-Court SetVs.Zone No | Number of possessions when the offense is set vs zone&nopt |
| Last 4 Seconds | Number of possessions when it is the last 4 seconds |
| Out Of Bounds | Number of possessions after inbounding from an outofbounds |
| Out of Bounds End | Number of possessions from an out of bounds from the end |
| Out of Bounds Side | Number of possessions from an out of bounds on the side |
| Total Points | Total Number of Points for the game |
| Win | 1 indicating a win and 0 indicating a loss |
This dataset contains the types of shots that were taken per game.
| Features | Description |
|---|---|
| ID | A unique ID to reference the games (row data) |
| Team | Name of the team |
| Season | The year of the regular season that the game took place |
| 2FG Attempts | Number of 2PT Field Goal (FG) Attempts |
| 2FG Made | Number of 2PT FG Made |
| 2FG Missed | Number of 2PT FG Missed |
| 3FG Attempts | Number of 3PT FG Attempts |
| 3FG Made | Number of 3PT FG Made |
| 3FG Missed | Number of 3PT FG Missed |
| All Free Throws | Number of Free Throws |
| Live Free Throws | Number of Live Free Throws |
| FG Attempts | Number of FG Attempts |
| FG Made | Number of FG Made |
| FG Missed | Number of FG Missed |
| Guarded Jump Shots | Number of Guarded Jump Shots |
| Unguarded Jump Shots | Number of Unguarded Jump Shots |
| Long Jump Shots | Number of 3 Point Shots |
| Medium Jump Shots | Number of shots from 17 ft to < 3 point line |
| Short Jump Shots | Number of shots from < 17 ft |
| Total Points | Total Number of Points for the game |
| Win | 1 indicating a win and 0 indicating a loss |
This dataset contains information about transition plays
| Features | Description |
|---|---|
| ID | a unique ID to reference the games (row data) |
| Team | Name of the team |
| Season | the year of the regular season that the game took place |
| All Push Ball | Number of Possesions where the ball is being pushed |
| Push Ball - Shot Attempt | A category of Push Ball where the ball is being pushed |
| Push Ball - Turnover | A category of Push Ball where the ball is being pushed |
| Push Ball to Half Court | A category of Push ball where the ball is being pushed to |
| Press Offense | Number of Possessions where the offense is being pressed |
| Transition Offense | Number of Transition plays |
| Transition Turnover | Number of Transition plays leading to a turnover |
| Total Points | Total Number of Points for the game |
| Win | 1 indicating a win and 0 indicating a loss |
| Features | Description |
|---|---|
| Game ID | a unique ID to reference the games (row data) |
| Date | the date that the game took place |
| Season | the year of the regular season that the game took place |
| Team | Name of the team |
| Player | Name of the player |
| Home | A binary value; 1 indicating Home team, 0 Away team |
| GP | Total number of games played |
| MPG | Minutes played per game |
| PPG | Points per game |
| PTS | Total Points scored |
| MIN | Total minutes played |
| FGM | Number of Field Goals Made for the team |
| FGA | Number of Field Goals Attempted for the team |
| Field Goal% | (FGM/FGA) x 100 |
| 3PM | Number of 3 Pointers Made by the team |
| 3PA | Number of 3 Pointers Attempted by the team |
| 3Point% | (3PM/3PA) x 100 |
| FTM | Number of FreeThrows Made by the team |
| FTA | Number of FreeThrows Attempted by the team |
| FT% | (FTM/FTA) x 100 |
| Assists | Number of Assists the team made |
| Rebounds | Number of Rebounds the team made |
| Steals | Number of Steals in the game |
| Blocks | Number of Blocks in the game |
| Turnovers | Number of Turnovers by the team |
| Features | Description |
|---|---|
| Game ID | a unique ID to reference the games (row data) |
| Date | the date that the game took place |
| Season | the year of the regular season that the game took place |
| Team | Name of the team |
| Home | A binary value; 1 indicating Home team, 0 Away team |
| FGM | Number of Field Goals Made for the team |
| FGA | Number of Field Goals Attempted for the team |
| Field Goal% | (FGM/FGA) x 100 |
| 3PM | Number of 3 Pointers Made by the team |
| 3PA | Number of 3 Pointers Attempted by the team |
| 3Point% | (3PM/3PA) x 100 |
| FTM | Number of FreeThrows Made by the team |
| FTA | Number of FreeThrows Attempted by the team |
| FT% | (FTM/FTA) x 100 |
| Assists | Number of Assists the team made |
| Rebounds | Number of Rebounds the team made |
| Steals | Number of Steals in the game |
| Blocks | Number of Blocks in the game |
| Turnovers | Number of Turnovers by the team |
| Points off Turnovers | Number of Points made off Turnovers by the team |
| Points in the Paint | Number of Points made in the Paint by the team |
| 2nd Chance Points | Total 2nd Chance Pts for the team |
| Bench Points | Number of Pts made by the bench players for the team |
| Fastbreak Pts | Number of Fastbreak Pts by the team |
| Largest Lead | The Largest Lead made by the team |
| Time of Largest Lead | The time of the team’s Largest Lead |
| Win | A binary value; 1 indicating a win, 0 loss |
| Winner 1st Qtr Pts | The number of points the scored in the 1st qtr |
| Winner 2nd Qtr Pts | The number of points scored in the 2nd qtr |
| Winner 3rd Qtr Pts | The number of points scored in the 3rd qtr |
| Winner 4th Qtr Pts | The number of points scored in the 4th qtr |
| OT Pts | The number of points scored in overtime |
Exploratory Data Analysis is valuable to data projects because it helps in understanding the data, making sure it is worth investigating, and checking for anomalies. The raw data needs to be validated to ensure that the data set was collected without errors.
Distributions are often described in terms of their density or density functions.
Density functions are functions that describe how the proportion of data or likelihood of the proportion of observations change over the range of the distribution. Certain analyses require certain distributions, and if they require all variables to be independently and identically distributed, then standardization will need to be used.
Below are basic summary statistics of the Play Types dataset, i.e. the minimum, quartiles, mean, median, and maximum of all the variables. In order to best interpret this data, the reader should refer to Table 1 in section 1.1 where each of the below features and their descriptions are given.
On average, there are 92.05 possessions (“Possessions” highlighted below) per game, among all 1452 regular season games in the dataset. The Spot-Up is the playtype with the highest average (i.e. most frequent during a game) of 22.35 Spot-Ups per game. A Spot-Up is when a player is set in a position to shoot and gets the ball to take the shot. Typically, this is a player waiting at the 3-point line. An Off-Screen possession results from an offensive player getting the ball when a screen was set by one of their teammates allowing them to be open for a pass. It is important to note these two types of possessions can never happen simultaneously, as a Spot-Up requires no screen being used before the player catches the ball. Examples of a player spotting up are: standing in the corner before catching-and-shooting, relocating to the 3-point line, or fading to the corner and getting the ball on a kick out. These possessions are not just catching and shooting. They can be catching-and-shooting, but attacking a close-out by dribbling into a pull-up, dribbling into a floater, or driving to the rim. It is worthwhile to analyze this playtype as it has the highest frequency among games, and thus coaches improving Spot-Up techniques can be used to a team’s advantage.
| mean | sd | median | min | max | range | |
|---|---|---|---|---|---|---|
| TotalPoints | 77.4400826 | 13.5884622 | 78.0 | 36 | 125 | 89 |
| Win* | 1.5000000 | 0.5001723 | 1.5 | 1 | 2 | 1 |
| Season* | 2.6033058 | 1.1111564 | 3.0 | 1 | 4 | 3 |
| AllIsolation | 8.6053719 | 4.5569282 | 8.0 | 0 | 27 | 27 |
| AllOffensiveRebounds | 11.0723140 | 4.0369936 | 11.0 | 2 | 26 | 24 |
| AllP.RBallHandler | 19.0847107 | 7.3219072 | 18.0 | 2 | 43 | 41 |
| Possessions | 92.0516529 | 7.6249584 | 92.0 | 52 | 137 | 85 |
| AllPost.Up | 8.4531680 | 5.2572562 | 8.0 | 0 | 34 | 34 |
| Cuts | 7.1122590 | 3.4954634 | 7.0 | 0 | 22 | 22 |
| Handoffs | 2.5716253 | 2.1355756 | 2.0 | 0 | 14 | 14 |
| Isolation.DefenseCommits | 2.6508264 | 2.0962322 | 2.0 | 0 | 17 | 17 |
| Isolation.SingleCovered | 5.9545455 | 3.5539555 | 5.0 | 0 | 22 | 22 |
| MiscellaneousPossessions | 6.7520661 | 3.2296379 | 6.0 | 0 | 20 | 20 |
| OffScreens | 4.0378788 | 2.7003990 | 4.0 | 0 | 16 | 16 |
| Off.Reb..PutBacks | 5.9035813 | 2.9562047 | 6.0 | 0 | 19 | 19 |
| Off.Reb..ResetOffense | 5.1687328 | 2.4915654 | 5.0 | 0 | 15 | 15 |
| P.RBallHandler.DefenseCommits | 10.9931129 | 5.0872424 | 11.0 | 0 | 32 | 32 |
| P.RBallHandler.SingleCovered | 7.7217631 | 4.2227431 | 7.0 | 0 | 28 | 28 |
| P.RBallHandler.Traps | 0.3698347 | 0.8604964 | 0.0 | 0 | 7 | 7 |
| P.RRollMan | 3.1666667 | 2.3649850 | 3.0 | 0 | 13 | 13 |
| Post.Up.DefenseCommits | 1.6763085 | 1.7005016 | 1.0 | 0 | 10 | 10 |
| Post.Up.HardDoubleTeam | 1.4407713 | 1.8879549 | 1.0 | 0 | 15 | 15 |
| Post.Up.SingleCovered | 5.3360882 | 3.8232836 | 5.0 | 0 | 25 | 25 |
| SpotUps | 22.3519284 | 5.7683973 | 22.0 | 4 | 44 | 40 |
| Transitions | 18.0172176 | 6.1807470 | 17.0 | 3 | 44 | 41 |
Distribution of PlayTypes Features.
The distributions of most of the Isolation, Post-Up and Pick and Roll plays are skewed to the right, along with Handoffs, Offscreens and Miscellaneous Posssessions. The rest of the plays are approximately normal.
Note: There is a difference in number of games per season because the number of games played per season increased from 19-20 games to 23-24 games in 2017/2018.
An outlier is defined as a sample or event that is very inconsistent with the rest of the data set. However, in sports outliers are not due to measurement errors, they are due to teams playing differently against other teams. Instead, it would be better to average the data and aggregate by team and season.
Scatterplots of certain Play Types vs. Wins (1) or Losses (0)
There is no clear pattern of any individual play type in respect to wins. This makes sense since different teams have different styles of play and have to adjust to their opponents’ style of play. It would make more sense to see the differentials for each game. For instance, if a team is not as tall as another team, the taller team may want to post-up more since they would have the advantage. This advantage may make the team more likely to win.
Below are the basic summary statistics of the Sets dataset which shows the number of times a team sets up their offense and where and when they do. Again, the reader can refer to Table 2 in section 1.2 for the features and their associated descriptions. It may seem like there is an anomaly with the half-court vs zone variables but this is due to zone defense not being a popular defensive style in the league so when a team plays zone defense for the entire game then the opposing team will have to set their offense against it. We can see that zone defenses have right skewed distributions which further shows that zone defense is not a popular defensive style in U Sports Basketball.
Warning in kable_styling(., font_size = 8): Please specify format in
kable. kableExtra can customize either HTML or LaTeX outputs. See https://
haozhu233.github.io/kableExtra/ for details.
| mean | sd | median | min | max | range | |
|---|---|---|---|---|---|---|
| AfterTimeOuts.ATO. | 8.637741 | 2.0509452 | 9.0 | 1 | 17 | 16 |
| HalfCourtSetAll | 74.034435 | 7.3461373 | 74.0 | 40 | 113 | 73 |
| HalfCourtSetAll.NoPts | 46.580579 | 7.0841925 | 46.0 | 24 | 73 | 49 |
| HalfCourtSetAll.Pts | 27.453857 | 5.2931895 | 27.0 | 11 | 48 | 37 |
| HalfCourtSetvs.Zone.NoPts | 2.807851 | 5.7575283 | 1.0 | 0 | 46 | 46 |
| HalfCourtSetvs.Man | 69.700413 | 10.8398110 | 71.0 | 6 | 113 | 107 |
| HalfCourtSetvs.Man.NoPts | 43.772727 | 8.7634901 | 44.0 | 4 | 71 | 67 |
| HalfCourtSetvs.Man.Pts | 25.927686 | 5.7791545 | 26.0 | 1 | 45 | 44 |
| HalfCourtSetvs.Zone | 4.334022 | 8.5731538 | 1.0 | 0 | 77 | 77 |
| HalfCourtSetvs.Zone.Pts | 1.526171 | 3.0937572 | 0.0 | 0 | 32 | 32 |
| Last4Sec.ofShotClock | 7.323003 | 3.4618639 | 7.0 | 0 | 20 | 20 |
| OutofBounds | 9.828512 | 3.1749254 | 10.0 | 1 | 23 | 22 |
| OutofBounds.End. | 5.244490 | 2.4351274 | 5.0 | 0 | 15 | 15 |
| OutofBounds.Side. | 4.584022 | 2.2218062 | 4.0 | 0 | 12 | 12 |
| TotalPoints | 77.440083 | 13.5884622 | 78.0 | 36 | 125 | 89 |
| Win | 0.500000 | 0.5001723 | 0.5 | 0 | 1 | 1 |
| Season* | 2.603306 | 1.1111564 | 3.0 | 1 | 4 | 3 |
Plot Matrix of Sets Dataset.
Below are summary statistics of the Shots dataset (features and associated description are given in Table 3 in section 1.3). From this we can see that on average, teams take more guarded shots than unguarded shots. Teams also take more long jump shots on average compared to short or medium jump shots. The average FG% from all teams from all 1488 games in the dataset is 27.75/68.1 = 40.75%. Teams on average attempt 25 3-Pointers and make about 8 per game which gives an average 3FG% of 32%; 2-Pointers have a higher efficiency on average because they are easier to score. Total Points are negatively correlated to guarded jump shots, short jump shots and medium jump shots, and are positively correlated to long jump shots (3 Pointers). It is self-explanatory that total points are negatively correlated to guarded shots as these have a higher likelihood of being missed. On the other hand, it is interesting to note that teams with players that take more short and medium jump shots as opposed to long shots have less total points, while teams with players taking more long jump shots have more total points. This shows that players with good 3-point shooting efficiency are highly valuable to a team and may in fact be an important factor to a team’s season performance.
| mean | sd | median | min | max | range | |
|---|---|---|---|---|---|---|
| X2FG.Attempts | 43.172865 | 7.9905014 | 43.0 | 19 | 76 | 57 |
| X2FG.Made | 19.894628 | 5.2749932 | 20.0 | 5 | 37 | 32 |
| X2FG.Missed | 23.278237 | 6.1437855 | 23.0 | 6 | 46 | 40 |
| X3FG.Attempts | 25.135675 | 6.5643332 | 25.0 | 8 | 47 | 39 |
| X3FG.Made | 7.883609 | 3.2277794 | 8.0 | 0 | 23 | 23 |
| X3FG.Missed | 17.252066 | 5.0190364 | 17.0 | 4 | 40 | 36 |
| All.Free.Throws | 19.064738 | 7.0358888 | 18.0 | 0 | 44 | 44 |
| FG.Attempts | 68.308540 | 7.8644519 | 68.0 | 40 | 102 | 62 |
| FG.Made | 27.778237 | 5.7206141 | 28.0 | 12 | 51 | 39 |
| FG.Missed | 40.530303 | 7.0999397 | 40.0 | 16 | 68 | 52 |
| Guarded.Jump.Shots | 12.511708 | 5.5112691 | 12.0 | 1 | 31 | 30 |
| Live.Free.Throws | 10.068870 | 3.6851550 | 10.0 | 0 | 23 | 23 |
| Long.Jump.Shots..3.point.shots. | 25.351240 | 6.5999639 | 25.0 | 8 | 48 | 40 |
| Medium.Jump.Shots..17..to..3.point.line. | 4.294766 | 2.8733717 | 4.0 | 0 | 19 | 19 |
| Short.Jump.Shots…17.. | 4.687328 | 2.9039348 | 4.0 | 0 | 16 | 16 |
| Total.Points | 77.440083 | 13.5884622 | 78.0 | 36 | 125 | 89 |
| Unguarded.Jump.Shots | 8.913223 | 4.8063152 | 8.0 | 0 | 27 | 27 |
| Win | 0.500000 | 0.5001723 | 0.5 | 0 | 1 | 1 |
| Season* | 2.603306 | 1.1111564 | 3.0 | 1 | 4 | 3 |
Plot Matrix for Shots Dataset
Comparing Shot Types vs. Wins(1) or Losses(0)
From the above figure, we can see that more unguarded shots (iii) is more highly associated to wins compared to guarded shots (iv). In this figure we can see that taking a lower number of medium jump shots (vi) contribute to more wins as opposed to the other types of shots (v & vii) that are taken.
Below are summary statistics of the Transitions dataset (features and associated descriptions are given in Table 4 in section 1.4). Total Points is most positively correlated to Transition Offense with 0.36 where Transition Offense occurs when a team gains possession of the ball and quickly pushes it to the opposing team’s basket. Total Points is most negatively correlated to Press Offense. Press Offense is when the offense (the team having possession of the ball) is being pressed by the other team, i.e. they are being defensively pressured in which members of the defense cover their opponents throughout the court and not just near their own basket. Being pressured would make it harder to score, thus why it is the most negatively correlated to points. The outliers (shown in the boxplots) are all on the upper tails and may be due to the pace of game having a big variance. For example, a team may have a higher Transition Offense rate when the pace of the game is fast, but if the pace is slow, they may not transition from defense to offense as often. The outliers should not be removed from the dataset since they are not measurement errors and provide useful information where the data points largely deviate from the average.
Plot Matrix of the Transitions Dataset
Distribution of Variables; Away vs. Home
The distributions for the home variables vs the away ones are very similar, however there is a slight difference between the Field Goal Percentage.
| Average | Statistic | |
|---|---|---|
| Away | 0.4092 | FG% |
| Home | 0.4228 | FG% |
| Away | 0.3123 | 3FG% |
| Home | 0.3281 | 3FG% |
There is a very slight difference between the home and away field goal percentages but does this mean that there is a home court advantage?
| Home Wins | Away Wins |
|---|---|
| 402 | 328 |
This shows there is a difference between the number of times a home team wins compared to an away team.
Risk Ratio (RR) or Relative Risk is a measurement often used in epidemiology. It is used to estimate the outcome between factors and outcomes. In our case we will use this measurement to see whether there is a statistically significant difference between teams playing at home versus away. A risk ratio of 1 means there is no difference, greater than 1 means there is a higher chance of winning if the team is playing at home, and less than 1 means the opposite [4]. An Odds Ratio (OR) is a ratio of ratios. It also quantifies the strength of the association between two events. If the odds ratio equals 1 then the odds of the events are the same. If the odds ratio is greater than 1 then the events are correlated in the sense that if compared to the absence of the second event, the presence of the second raises the odds of the first event, and symmetrically the presence of the first event raises the odds of the second event. In our case we will obtain both measurements to see the strength of association between teams playing at home versus teams playing away.
2 by 2 table analysis:
------------------------------------------------------
Outcome : Win
Comparing : Home vs. Away
Win Lose P(Win) 95% conf. interval
Home 402 328 0.5507 0.5144 0.5864
Away 328 402 0.4493 0.4136 0.4856
95% conf. interval
Relative Risk: 1.2256 1.1049 1.3595
Sample Odds Ratio: 1.5021 1.2222 1.8462
Conditional MLE Odds Ratio: 1.5017 1.2156 1.8562
Probability difference: 0.1014 0.0501 0.1519
Exact P-value: 0.0001
Asymptotic P-value: 0.0001
------------------------------------------------------
The probability of winning at home is 55% whereas the probability of winning away is 45%. The Sample Odds Ratio tells us that odds of a team winning is 1.5 higher given they are playing at home compared to playing away. The Relative Risk tells us that home teams have 1.22 times the ‘risk’ of winning compared to away teams.
A coach may be more interested in which teams in particular play better at home, and how much better they play.
Difference of Home Statistics vs. Away Statistics of the 2018-2019 Season for each team.
Above is a table of the every team from the 2018-2019 season where the Home statistics are all subtracted by the Away Statistics, i.e. the statistics of a team when they were playing at home subtracted by statistics when they were playing away. A positive number indicates that the team performed better at home (except for turnovers). For example, Carleton shot their free throws 8.91% higher at home.
The top 3 teams that shot their free thows better at home are Western (12.32%), Carleton (8.91%), and Lakehead (5.58%). The top 3 teams that shot field goals better at home are Ottawa (6.44%), Toronto (5.59%), and Windsor (4.88%). The top 3 teams that shot 3 pointers better at home are Ottawa (11.20%), Laurentian (8.18%), Nipissing (5.69%). The top 3 teams that turnover the ball the least when playing at home are Algoma (-3.17), Western (-2.75), and Laurentian (-2.65). The top 3 teams that rebound the ball more at home are Ryerson (10.64), Brock (9.08), and Laurentian (6.11). The top 3 teams that scored more points at home are Ottawa (12.11), Toronto (10.89), and Laurentian (9.74). On average, the teams turned over the ball 6 less times at home,
In conclusion, many teams benefit from playing at home, and different teams excel differently. According to a Bleacher Report study [5], referee bias and the psychological impact of playing at home are two of the biggest factors of why there is a large difference between home and away statistics. Studies have show that when a crowd is vocal, it impacts the way referees call a game. Also, referees have historically favored home teams. In addition, the psychological impact of playing at home is a self-sustaining placebo effect: Home-court advantage gives the home team an edge simply because players believe that it does.
Wins per Season for all teams in the OUA division
The above shows that Brock, Carleton, Laurentian, UofT, and Western all steadily improved and peaked at the 2017-2018 season. The Ryerson Rams stayed consistent and peaked 2018-2019 season. There are few teams that are consistently not winning more than 10 games a season such as Algoma, Nipissing and York.
The table below gives the correlations between different Play Types and Total Points scored in a game. Note that a negative number represents a negative correlation between the two features while a positive number represents a positive correlation. A correlation measurement closer to 0 represents a non-linear relationship as opposed to a correlation measurement further from 0.
| Play Type | Correlation to Total Points |
|---|---|
| All Isolation | -0.046532124 |
| All Offensive Rebounds | 0.154450763 |
| All PR Ball Handler | 0.042544034 |
| All Post-Up | -0.040890549 |
| Cuts | 0.227413916 |
| Handoffs | -0.017105738 |
| Isolation Defense Commits | -0.045338194 |
| Isolation Single Covered | -0.032922237 |
| Miscellaneous Possessions | -0.075670507 |
| OffScreens | -0.135119217 |
| Offensive Rebound Putback | 0.154161317 |
| Offensive Rebound Reset Offense | 0.067340925 |
| PR Ball Handler Defense Commits | 0.053122283 |
| PR Ball Handler Single Covered | 0.013881865 |
| PR Ball Handler Traps | -0.020176743 |
| PR Roll Man | 0.101940646 |
| Post Up Defense Commits | -0.056702755 |
| Post Up Hard Double Team | -0.090199983 |
| Post Up Single Covered | 0.013534056 |
| Spot Ups | -0.007428537 |
| Transitions | 0.317687812 |
The plays that are most positively correlated to total points are transitions, cuts, and offensive rebounds. This could mean that transitions, cuts and offensive rebounds contribute to the most points compared to all other plays. The play that is most negatively correlated to total points is offscreens.
To account for outliers and since some teams have played more a game or two more than others, the dataset was transformed by averaging the statistics per game per season, and the wins were summed.
| Features | Correlation to Number of Wins |
|---|---|
| Press Offense | -0.175526612 |
| Push Ball From Shot Attempt | 0.091290517 |
| Push Ball From Turnover | 0.484877383 |
| Push Ball to Half Court | 0.115357730 |
| Free Throws | -0.092037012 |
| Guarded Jump Shots | -0.153775302 |
| Unguarded Jump Shots | 0.592888473 |
| Long Jump Shots | 0.331422037 |
| Medium Jump Shots | -0.365731181 |
| Short Jump Shots | -0.197920494 |
| Cuts | 0.373973357 |
| Handoffs | -0.072249207 |
| Isolation Single Covered | -0.354371026 |
| Isolation Defense Commits | -0.006820286 |
| Miscellaneous Possessions | -0.406512329 |
| OffScreens | -0.280720057 |
| Offensive Rebound PutBack | 0.166641468 |
| Offensive Rebound Reset | 0.368757978 |
| PR Ball Handler Defense Commit | 0.338107184 |
| PR Ball Handler Single Covered | 0.072595284 |
| PR Ball Handler Traps | 0.115888712 |
| PR Roll Man | 0.311420152 |
| Post Up Defense Commits | -0.171605486 |
| Post Up Hard Double Team | -0.067993980 |
| Post Up Single Covered | -0.183923948 |
| Spot Ups | 0.247910273 |
| Transitions | 0.231836305 |
| Assists | 0.653151380 |
| Blocks | 0.337827582 |
| Steals | 0.490803284 |
| Total Rebounds | 0.447741075 |
| Turnovers | -0.319146688 |
Carleton has been a strong team for over the past decade. They have won 14 of the last 17 U Sports national championships (2003-2007, 2009, 2011-2017, 2019) due to remarkable coaching and great roster.
The Raven’s head coach Dave Smart has a lot to do with the success of the team, as he has led them to 13 of their 14 championships between 1999 (his first year as coach) to 2019. From 2003 to 2007 inclusive, Smart led the Ravens to five consecutive Canadian Interuniversity Sport (CIS) national championships, which were the first CIS championships won by Carleton in any sport.[6] He has been head coach for all seasons since 1999 except for 2015-2016, during which he was on sabbatical leave. Below we analyze data from 2015 to 2019, from which it can even be deduced that his absence affected the team’s performance.
Carleton Wins Per Season.
The Ravens won the fewest regular season games in the 2015-2016 season when Dave Smart took a sabbatical (his nephew, Rob Smart was the interim coach). The team had also lost four starters from the previous year’s championship. It is evident that these two factors greatly affected the Raven’s performance throughout that season.
Carleton Play Types Proportionate to Total Possessions Comparison Per Season
Barplot (i) illustrates that the 2015/2016 season had significantly more Post-Up plays. This could be a preference of coach Rob Smart due to a larger roster or could be his preferred style of play in general. The 2015/2016 season had the fewest isolation plays (ii) but not a significant change from the 2016/2017 season. The 2015/2016 season had the most transition turnovers(iv) (transition plays resulting in a turnover) along with the 2016-2017 season.
A Test Of Equal Or Given Proportions can be used for testing the null that the proportions in several groups are the same, or that they equal certain given values. This can be used to see if the the 2015-16 season is statistically significant compared to the other seasons for the above play types.
4-sample test for equality of proportions without continuity
correction
data: carleton$AllPostUp out of carleton$AllPossessionClips
X-squared = 86.565, df = 3, p-value < 2.2e-16
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4
0.12204951 0.04860267 0.05393996 0.07446809
4-sample test for equality of proportions without continuity
correction
data: carleton$AllIsolation out of carleton$AllPossessionClips
X-squared = 24.15, df = 3, p-value = 2.324e-05
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4
0.07369027 0.08505468 0.09193246 0.11798839
4-sample test for equality of proportions without continuity
correction
data: carleton$Transitions out of carleton$AllPossessionClips
X-squared = 15.781, df = 3, p-value = 0.001257
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4
0.2020725 0.2381531 0.2049719 0.1856867
4-sample test for equality of proportions without continuity
correction
data: carleton$TransitionTurnover out of carleton$AllPossessionClips
X-squared = 13.696, df = 3, p-value = 0.003349
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4
0.03799655 0.03766707 0.02861163 0.02030948
All the p-values are less than 0.05 which tells us that the proportion is not the same for each season,i,e. at least one season is different from the others. However this does not mean the 2015-16 season is the one that is statistically different from the rest. It is only the case for Post-Up plays that the 2015-16 season is statistically different from all the others. It is also the most significant difference with an X-squared value of 85.565. For Isolation it is the 2018-19 season that is statistically different from the rest.
Carleton Pace Per Season
This barplot illustrates that the 2015/2016 Ravens pushed the ball the fewest and as a result had a slower pace on average.
4-sample test for equality of proportions without continuity
correction
data: carleton$AllPushBall out of carleton$AllPossessionClips
X-squared = 13.444, df = 3, p-value = 0.003768
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4
0.3235463 0.3791009 0.3686679 0.3520309
The p-value is less than 0.05 which tells us that the proportion is not the same for each season,i,e. at least one season is different from the others. This shows that the 2015-16 season pushed the ball significantly fewer times compared to the other seasons.
Carleton Shot Types Per Season
The 2015/2016 Ravens had the fewest 3 pointers and significantly more medium jump shots (these are jump shots from 17 ft to behind the 3 point line).
4-sample test for equality of proportions without continuity
correction
data: carleton$LongJumpShots out of carleton$AllPossessionClips
X-squared = 15.228, df = 3, p-value = 0.001632
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4
0.2521589 0.3110571 0.2922139 0.2843327
The p-value is less than 0.05 which tells us that the proportion is not the same for each season,i,e. at least one season is different from the others. This shows that the 2015-16 season shot 3-pointers significantly fewer times compared to the other seasons.
4-sample test for equality of proportions without continuity
correction
data: carleton$MediumJumpShots out of carleton$AllPossessionClips
X-squared = 32.642, df = 3, p-value = 3.831e-07
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4
0.03742084 0.02551640 0.01923077 0.01063830
The p-value is less than 0.05 which tells us that the proportion is not the same for each season,i,e. at least one season is different from the others. This shows that the 2015-16 season took significantly more medium jump shots than the other seasons.
Carleton Bench Points Per Season
This barplot shows that the 2015/2016 Ravens had the fewest bench points which could suggest that the coach did not utilize his bench players as much. Also it can suggest something about the skills of the bench players during that season compared to the other seasons.
4-sample test for equality of proportions without continuity
correction
data: carleton.reg$BenchPoints out of carleton.reg$TotalPoints
X-squared = 84.589, df = 3, p-value < 2.2e-16
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4
0.3120049 0.4181922 0.4071709 0.4578772
The p-value is less than 0.05 which tells us that the proportion is not the same for each season,i,e. at least one season is different from the others. This shows that the 2015-16 season had significantly fewer bench points than the other seasons. This could be due to the coach preferring to play his starters for most of the game.
Carleton is a very successful team. Carleton has played 89 games from 2015-16 season to the 2018-19 season, of which they have only lost 7 of them. Using a decision tree, we can find where Carleton has lost their games. A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, in this case the consequence is a game loss. The paths from root to leaf represent classification rules [7].
Carleton Classification Tree for Wins using Play Type Data on all Seasons vs Classification Tree for Wins using Play Type Data excluding 2015-2016 Season
Using the tree on the left we can see that Carleton has lost games when they have done more than 17 Post plays or fewer than 17 Post plays but with low Transition plays and low Cuts. They have also lost when they have a high number of post plays but few pick and roll plays. If we exclude the 2015-2016 season, the number of branches in the classification tree becomes reduced. Carleton has lost games when they made 15 Pick and Roll Ball Handler plays, and then fewer than six Post-Up plays.
Carleton Classification Tree for Wins using Shots Data
On the left tree, we see that Ravens lost when they scored a low number of 3 pointers and also when they take a lot of guarded jump shots (this is considering all the seasons from 2015 to 2019). When we exclude the 2015-2016 season, the Ravens lost their games when they scored less than 12 free throws, shot more than six short jump shots, and less than one medium jump shot. They have won all games with more than 12 free throws taken. The more fouls Carleton draws, the more probable they will win.
Carleton Decision Tree
From the left decision tree we see that the Ravens have lost when they made a lot of post-up plays and when they pushed the ball a low number of times (this includes all seasons). If we exclude the 2015-2016 season, the branches of the classification tree become reduced. The Ravens lost games when they made fewer than four steals and pushed the ball more than 32 times.
There are four statistics that give insight of what is happening in a game: Field Goal Attempts, Adjusted Field Goal Percentage (a formula (aFG% = [(Total Points - Free Throws Made)/Field Goal Attempts]/2) designed to determine the impact of 3 pointers on a player’s overall shooting percentage.), Free Throw Attempts, and Free Throw Percentage. If a team beats their opponent in all four statistics, then they will always win [8]. This is because to win you either need to shoot a higher percentage or take more shots than your opponent. The number of shot attempts and free throw attempts are related to other statistics such as turnovers, rebounds, fouls, blocks, and steals. However, assuming the teams are attempting the same number of shots, it would be important for a coach to know which players shoot efficient in which area of the court.
Using Synergy’s Multi-Game Shot Chart, we can see the shooting effiencies for all areas on the court. The shooting efficiencies are represented by the percentage values, while the fractions below these represent the number of shots made over the number of shots attempted. Efficiencies with higher number of shots attempted (i.e. the denominator) are more worthwhile to analyze because they represent players’ success in areas that they shoot from more frequently.
Carleton’s Overall Shooting Efficiency for all of the 2018-2019 Season.
Carleton is most effective nearest the net just like most teams (if not all). Their most effective 3-point area is the top left where they shoot on average 41.7%. The team attempted over 120 shots in the middle, right, and left 3-point areas but shot around 10% better in the top left 3-point area. This would suggest the players who shoot from that area to take more shots from there.
The below figure illustrates a table of the Carleton Players’ Shooting Statistics throughout the 2018-2019 season. Although some percentage efficiencies of some players are higher than others, it is important to first consider the number of attempts made. For example, J. Louis has a 100% 3FG, but only attempted (and thus made) one shot. Thus considering only the Field Goal Percentages is not useful and one must also consider the attempts. To facilitate this, the table has been ordered in descending order of Field Goal Attempts (FGAs). To determine the players with better shooting efficiency, we first look at the top four players with the most FGAs. From this, we can compare their Adjusted Field Goal Percentages relative to each other.
Carleton Players’ Overall Shooting Efficiency for all of the 2018-2019 Season sorted by field goal attempts. Note that “m” denotes “missed”, while “M” denotes “made” and “A” denotes “attempts”.
From the table, different deductions can be made using the aforementioned method. It is clear that some players shoot more while outputting less points and vice versa. From the top four players with the highest FGAs, Eddie Ekiyor has the highest aFG% of 66.5%. This means that relative to the other three players, he took the fewest attempts (200 FGAs), yet scored 319 points. On the other hand, Yasiin Joseph took more attempts (213 FGAs) and scored 227 points. This type of data is very useful as coaches can use it to enhance the shooting skills of players who already have the confidence to make many Field Goal Attempts, but are lacking on shot accuracy.
Carleton Players’ Play Type Shooting Efficiency for all of the 2018-2019 Season sorted by adjusted field goal % .
The plays that result in the most efficient shooting percentages are Pick & Roll-Roll Man, Offensive Rebound - Put Backs, Cuts, Transitions, Pick & Roll-Ball Handler, and Post-Up. Cuts and Transitions have the most shot attempts from the plays mentioned. The plays with the lowest adjusted field goal percentages are Handoffs, Isolation, Miscellaneous, and Off Screens. This may suggest that these plays should not be performed as much as the other more effective plays.
The data aggregated is scraped from the OUA website. Every box score per game per season is collected and aggregated so that there are player statistics for every season from 2014-15 to 2018-19 (this is because the Player Statistics came from the OUA website which has 2014-15 data).
The goal is to use player statistics to gather insights on how players contribute to the game and how to categorize players using unsupervised learning.
The data is a subset of the player data with certain filters on the number of games played and the minutes per game. There are dataframes for every season from 2014-15 to 2018-19 with players that have played at least 15 games of at least 20 minutes per game. The games are regular season games from the U Sports division, Ontario University Athletics conference. All the variables are totals for the season except for PPG (Points per game) and MPG (Minutes per game)
First the data will be normalized in order to prepare the data for k-means clustering. This is helpful because some statistics have very different ranges e.g. the number of points compared to the number of steals. Therefore the variables will be comparable.
K-Means Clustering is a popular unsupervised machine learning algorithm. The goal of K-Means is to group similar data points in a dataset of unlabeled data. It does this by dividing the data into k clusters where each observation belongs to the cluster closest to the mean (cluster centroid) by using a distance metric (most usually Euclidean distance).
Since K-Means clustering is an unsupervised algorithm, this means that the number of clusters is not known. However, there are techniques that can be used to find an optimal number of clusters such as the gap method, silhouette method, within-cluster sum of squares method, D - index, etc. Different techniques and configurations of the techniques will be used for each season’s clustering for finding the optimal number of clusters.
The technique that will be used to find the optimal number of clusters for the 2014-2015 season is the D-index method (Lebart et al. 2000). The D-index is based on clustering gain on intra-cluster inertia [8]. Intra-cluster inertia can be defined as:
Intra-cluster Inertia formula.
The clustering gain should be minimized. The optimal cluster configuration can be identified by the sharp knee that corresponds to a significant decrease of the first differences of clustering gain versus the number of clusters. This knee or great jump of gain values can be identified by a significant peak in second differences of clustering gain.
Finding the Optimal Cluster using D-index for the 2014-15 season.
In the plot of D-index, we seek a significant knee (the significant peak in D-index second differences plot) around 8, that corresponds to a significant increase of the value of measure. The number of clusters that the method suggests is 8 clusters.
Cluster Plot for 2014-15 Season. The axes are the Principal Components where Dim1 is the first PC and Dim2 is the second PC. The first PC explains 35.5% of the data and the second PC explains 18.5%.
| Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Cluster 5 | Cluster 6 | Cluster 7 | Cluster 8 | |
|---|---|---|---|---|---|---|---|---|
| 3PointMade | 31.47 | 40.14 | 12.77 | 19.57 | 2.25 | 17.25 | 23.50 | 19.47 |
| 3PointAttempted | 81.40 | 112.57 | 37.86 | 64.29 | 8.92 | 46.75 | 74.00 | 56.47 |
| Assists | 38.07 | 56.14 | 22.50 | 47.71 | 22.17 | 36.50 | 59.00 | 35.20 |
| Blocks | 4.33 | 6.29 | 3.64 | 5.86 | 15.25 | 21.25 | 4.50 | 4.13 |
| DefensiveRebounds | 59.60 | 64.86 | 44.18 | 81.86 | 79.83 | 128.75 | 94.00 | 58.00 |
| FieldGoalMade | 84.87 | 118.43 | 42.91 | 87.29 | 78.50 | 132.50 | 149.00 | 50.53 |
| FieldGoalAttempted | 194.00 | 278.14 | 104.55 | 209.29 | 157.50 | 264.75 | 304.50 | 135.27 |
| FreeThrowsMade | 44.80 | 58.57 | 20.86 | 55.71 | 40.50 | 64.25 | 99.50 | 29.40 |
| FreeThrowsAttempted | 59.87 | 73.57 | 32.59 | 75.00 | 62.83 | 94.75 | 137.50 | 40.67 |
| Minutes | 502.93 | 579.86 | 387.45 | 619.14 | 450.75 | 547.25 | 694.00 | 500.93 |
| OffensiveRebounds | 25.67 | 16.86 | 19.00 | 25.86 | 44.50 | 52.50 | 26.50 | 16.67 |
| PersonalFouls | 39.40 | 35.43 | 39.50 | 50.71 | 46.17 | 33.50 | 44.50 | 44.93 |
| Points | 246.00 | 335.57 | 119.45 | 249.86 | 199.75 | 346.50 | 421.00 | 149.93 |
| Rebounds | 85.27 | 81.71 | 63.18 | 107.71 | 124.33 | 181.25 | 120.50 | 74.67 |
| Steals | 19.60 | 21.86 | 13.00 | 23.57 | 12.50 | 16.75 | 27.50 | 20.53 |
| Turnovers | 34.47 | 40.14 | 28.55 | 48.43 | 30.50 | 47.25 | 57.00 | 34.27 |
| Home | 9.53 | 9.00 | 9.27 | 9.14 | 9.50 | 9.50 | 10.50 | 9.47 |
| GamesPlayed | 18.73 | 18.29 | 18.00 | 19.00 | 18.92 | 18.75 | 20.00 | 18.93 |
| PointsPerGame | 13.15 | 18.39 | 6.64 | 13.19 | 10.59 | 18.47 | 21.05 | 7.95 |
| MinutesPerGame | 26.85 | 31.74 | 21.53 | 32.60 | 23.81 | 29.20 | 34.70 | 26.53 |
| 3P% | 0.37 | 0.36 | 0.30 | 0.29 | 0.18 | 0.37 | 0.32 | 0.38 |
| FG% | 0.44 | 0.43 | 0.42 | 0.42 | 0.50 | 0.50 | 0.49 | 0.38 |
| FT% | 0.74 | 0.79 | 0.66 | 0.76 | 0.64 | 0.70 | 0.72 | 0.72 |
| TrueShooting% | 0.56 | 0.54 | 0.50 | 0.52 | 0.54 | 0.56 | 0.57 | 0.49 |
| Player | Team | Cluster |
|---|---|---|
| 15-Zachary Angelini | Brock | 1 |
| 10-Connor Wood | Carleton | 1 |
| 23-Philip Scrubb | Carleton | 1 |
| 23-Aaron Redpath | McMaster | 1 |
| 32-Joe Rocca | McMaster | 1 |
| 06-Caleb Agada | Ottawa | 1 |
| 12-Aaron Best | Ryerson | 1 |
| 21-Adika Peter-McNeilly | Ryerson | 1 |
| 9-M Sahota | Toronto | 1 |
| 22-Anthony Spiridis | Western | 1 |
| 06-Mitch Farrell | Windsor | 1 |
| 09-Alex Campbell | Windsor | 1 |
| 21-Khalid Abdel-Gabar | Windsor | 1 |
| 3-Richard Iheadindu | York | 1 |
| 8-Nathan Culbreath | York | 1 |
| 11-Johneil Simpson | Brock | 2 |
| 14-Ryan Bennett | Laurentian | 2 |
| 06-Will Coulthard | Laurier | 2 |
| 08-Johnny Berhanemeskel | Ottawa | 2 |
| 21-Greg Faulkner | Queen’s | 2 |
| 7-Jahmal Jones | Ryerson | 2 |
| 22-J Clarke | Toronto | 2 |
| 2-Jamal Mucket-Sobers | Algoma | 3 |
| 3-AJ Andre Barder | Algoma | 3 |
| 4-Thomas Chalmers | Algoma | 3 |
| 6-Adam Benrabah | Algoma | 3 |
| 13-J.e Pierre-Charles | Carleton | 3 |
| 05-Jonathan Wallace | Guelph | 3 |
| 12-Michel Clark | Guelph | 3 |
| 23-Jamar Coke | Lakehead | 3 |
| 09-Luke Allin | Laurier | 3 |
| 4-Joey Puddister | Nipissing | 3 |
| 5-Marvin Ngonadi | Nipissing | 3 |
| 7-Jerron Rhodes | Nipissing | 3 |
| 01-Vikas Gill | Ottawa | 3 |
| 05-Mehdi Tihani | Ottawa | 3 |
| 09-Matt Plunkett | Ottawa | 3 |
| 10-Cy Samuels | Queen’s | 3 |
| 20-Ryall Stroud | Queen’s | 3 |
| 8-D Ankrah | Toronto | 3 |
| 07-Jedson Tavernier | Western | 3 |
| 10-Nidun Chandrakumar | York | 3 |
| 4-Phillip Cunningham-Gillen | York | 3 |
| 5-Gene Spagnuolo | York | 3 |
| 33-Matt Marshall | Brock | 4 |
| 04-Daniel Dooley | Guelph | 4 |
| 08-Dwayne Harvey | Lakehead | 4 |
| 15-Tychon Carter-Newman | Laurentian | 4 |
| 44-Sam Hirst | Laurentian | 4 |
| 5-Jaspreet Gill | Waterloo | 4 |
| 08-Quinn Henderson | Western | 4 |
| 31-Guillaume Boucard | Carleton | 5 |
| 21-Trevor Thompson | Guelph | 5 |
| 22-Anthony McIntosh | Lakehead | 5 |
| 24-Bacarius Dinkins | Lakehead | 5 |
| 15-Aiddian Walters | Laurier | 5 |
| 20-Kyrie Coleman | Laurier | 5 |
| 10-Taylor Black | McMaster | 5 |
| 22-Rohan Boney | McMaster | 5 |
| 23-Marcos Clennon | Nipissing | 5 |
| 04-Gabriel Gonthier-Dubue | Ottawa | 5 |
| 15-Kadeem Green | Ryerson | 5 |
| 07-Evan Matthews | Windsor | 5 |
| 11-Thomas Scrubb | Carleton | 6 |
| 7-D Johnson | Toronto | 6 |
| 12-Rotimi Osuntola | Windsor | 6 |
| 22-Nick Tufegdzich | York | 6 |
| 6-Myles Charvis | Waterloo | 7 |
| 12-Greg Morrow | Western | 7 |
| 10-Sean Clendinning | Algoma | 8 |
| 5-Brett Zufelt | Algoma | 8 |
| 03-Gavin Resch | Carleton | 8 |
| 21-Alex Robichaud | Lakehead | 8 |
| 11-David Aromolaran | Laurentian | 8 |
| 02-James Agyeman | Laurier | 8 |
| 03-Garrison Thomas | Laurier | 8 |
| 25-Adam Presutti | McMaster | 8 |
| 6-Jordon Campbell | Nipissing | 8 |
| 12-Tanner Graham | Queen’s | 8 |
| 5-S Usher | Toronto | 8 |
| 3-Jon Ravenhorst | Waterloo | 8 |
| 7-Ben Davis | Waterloo | 8 |
| 05-Tom Filgiano | Western | 8 |
| 10-Mike Rocca | Windsor | 8 |
Each cluster can be categorized as a type of player.
Cluster 1: Efficient Playmakers & Scorers This cluster of players have the most assists and the second most points per game. They have a big defensive impact through the number of steals they get and can control the tempo well and score.
Cluster 2: All-Around Players These players can get rebounds, pass and score well.
Cluster 3: Dominant Big Men These players are the most dominant big men in the league with the most rebounds (defensive and offensive), blocks, and points.
Cluster 4: Smart Catch & Shoot Players These players make the best decisions and turnover the ball the fewest. They do not dribble the ball much and are the most efficient shooters.
Cluster 5: Aggressive Defenders These players are aggressive and foul the most out of all the other clusters. They have a bigger impact on defense since they do not shoot well.
Cluster 6: Role Players These players contribute to many plays and work both offensively and defensively.
Cluster 7: Second Tier Playmakers These players are less dominant playmakers that can still score efficiently.
Cluster 8: Second Tier Small Players These players play small but do not shoot as efficiently as the other players or create as many plays.
For this season, the elbow method [9] will be used to find the optimal number of clusters. The elbow method looks at the percentage of variance explained as a function of the number of clusters: One should choose a number of clusters so that adding another cluster does not give much better modeling of the data. More precisely, if one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the “elbow criterion”. This “elbow” cannot always be unambiguously identified.
Elbow Method for finding the optimal number of clusters for the 2015-16 season
The number of clusters that will be used for this season is 6 for this season. Below are tables to show the average statistics for each cluster and also which players belong to which cluster.
Cluster Plot of 2015-16 Season
| Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Cluster 5 | Cluster 6 | |
|---|---|---|---|---|---|---|
| 3PointMade | 14.85 | 22.62 | 18.75 | 13.16 | 31.75 | 44.57 |
| 3PointAttempted | 48.15 | 67.88 | 55.75 | 41.84 | 86.00 | 133.29 |
| Assists | 39.38 | 57.12 | 43.25 | 24.91 | 40.62 | 52.71 |
| Blocks | 7.85 | 4.75 | 13.50 | 7.41 | 6.44 | 3.29 |
| DefensiveRebounds | 72.00 | 72.50 | 111.25 | 51.62 | 63.56 | 68.71 |
| FieldGoalMade | 101.38 | 56.25 | 150.50 | 48.56 | 85.62 | 111.14 |
| FieldGoalAttempted | 236.23 | 150.62 | 317.75 | 119.72 | 196.88 | 290.00 |
| FreeThrowsMade | 59.38 | 28.75 | 99.00 | 23.69 | 32.88 | 55.14 |
| FreeThrowsAttempted | 83.46 | 40.12 | 130.25 | 34.47 | 43.44 | 74.00 |
| Minutes | 564.00 | 620.00 | 631.50 | 419.31 | 490.81 | 639.14 |
| OffensiveRebounds | 30.77 | 25.75 | 44.50 | 21.50 | 18.19 | 14.57 |
| PersonalFouls | 44.92 | 48.88 | 47.25 | 39.56 | 38.44 | 35.57 |
| Points | 277.00 | 163.88 | 418.75 | 133.97 | 235.88 | 322.00 |
| Rebounds | 102.77 | 98.25 | 155.75 | 73.12 | 81.75 | 83.29 |
| Steals | 23.85 | 25.00 | 30.25 | 15.72 | 18.88 | 24.14 |
| Turnovers | 49.46 | 40.50 | 58.00 | 26.00 | 34.50 | 49.57 |
| Home | 9.69 | 9.62 | 9.75 | 9.31 | 9.25 | 9.71 |
| GamesPlayed | 19.08 | 19.38 | 19.00 | 18.38 | 18.25 | 19.43 |
| PointsPerGame | 14.58 | 8.47 | 22.02 | 7.34 | 12.99 | 16.58 |
| MinutesPerGame | 29.60 | 31.98 | 33.18 | 22.87 | 26.96 | 32.90 |
| 3P% | 0.26 | 0.32 | 0.31 | 0.28 | 0.38 | 0.33 |
| FG% | 0.43 | 0.37 | 0.47 | 0.41 | 0.44 | 0.39 |
| FT% | 0.72 | 0.72 | 0.76 | 0.71 | 0.77 | 0.76 |
| TrueShooting% | 0.50 | 0.48 | 0.55 | 0.49 | 0.54 | 0.50 |
| Player | Team | Cluster |
|---|---|---|
| 10-Sean Clendinning | Algoma | 1 |
| 24-Bacarius Dinkins | Lakehead | 1 |
| 11-David Aromolaran | Laurentian | 1 |
| 44-Sam Hirst | Laurentian | 1 |
| 12-Matt Chesson | Laurier | 1 |
| 03-Leon Alexander | McMaster | 1 |
| 23-Aaron Redpath | McMaster | 1 |
| 13-Marcus Lewis | Nipissing | 1 |
| 21-Adika Peter-McNeilly | Ryerson | 1 |
| 07-Ben Davis | Waterloo | 1 |
| 10-Peter Scholtes | Western | 1 |
| 22-Anthony Spiridis | Western | 1 |
| 08-Nathan Culbreath | York | 1 |
| 33-Matt Marshall | Brock | 2 |
| 03-Nick Burke | Lakehead | 2 |
| 21-Alexandre Robichaud | Lakehead | 2 |
| 02-Simon Mikre | Laurier | 2 |
| 04-Joey Puddister | Nipissing | 2 |
| 06-Dylan Phillips | Waterloo | 2 |
| 05-Tom Filgiano | Western | 2 |
| 10-Mike Rocca | Windsor | 2 |
| 13-Dani Elgadi | Brock | 3 |
| 06-Devin Johnson | Toronto | 3 |
| 12-Greg Morrow | Western | 3 |
| 09-Alex Campbell | Windsor | 3 |
| 05-Brett Zufelt | Algoma | 4 |
| 06-Nathan Riley | Algoma | 4 |
| 13-Reng Gum | Algoma | 4 |
| 25-Tyler Brown | Brock | 4 |
| 15-Drew Walford | Guelph | 4 |
| 31-Jack Beatty | Guelph | 4 |
| 10-Nick Simon | Laurentian | 4 |
| 32-Joseph Sykes | Laurentian | 4 |
| 03-Garrison Thomas | Laurier | 4 |
| 04-Trevon McNeil | McMaster | 4 |
| 21-David McCulloch | McMaster | 4 |
| 22-Rohan Boney | McMaster | 4 |
| 05-Marvin Ngonadi | Nipissing | 4 |
| 07-Jerron Rhodes | Nipissing | 4 |
| 09-Kalil Langston | Nipissing | 4 |
| 01-Vikas Gill | Ottawa | 4 |
| 05-Mehdi Tihani | Ottawa | 4 |
| 09-Matt Plunkett | Ottawa | 4 |
| 13-Nathan McCarthy | Ottawa | 4 |
| 13-Sammy Ayisi | Queen’s | 4 |
| 20-Ryall Stroud | Queen’s | 4 |
| 06-Roshane Roberts | Ryerson | 4 |
| 22-Juwon Grannum | Ryerson | 4 |
| 05-Sage Usher | Toronto | 4 |
| 09-Manny Sahota | Toronto | 4 |
| 21-Daniel Johansson | Toronto | 4 |
| 06-Alex Coote | Western | 4 |
| 07-Jedson Tavernier | Western | 4 |
| 15-Micah Kirubel | Windsor | 4 |
| 22-Tyler Persaud | Windsor | 4 |
| 04-Philip Gillen | York | 4 |
| 05-Gene Spagnuolo | York | 4 |
| 03-Andre Barber | Algoma | 5 |
| 14-Ryan Bennett | Brock | 5 |
| 03-Gavin Resch | Carleton | 5 |
| 10-Connor Wood | Carleton | 5 |
| 31-Guillaume Boucard | Carleton | 5 |
| 41-Kaza Kajami-Keane | Carleton | 5 |
| 04-Daniel Dooley | Guelph | 5 |
| 05-Jonathan Wallace | Guelph | 5 |
| 11-Taylor Boers | Guelph | 5 |
| 05-Troy Joseph | McMaster | 5 |
| 12-Tanner Graham | Queen’s | 5 |
| 04-Ammanuel Diressa | Ryerson | 5 |
| 12-Aaron Best | Ryerson | 5 |
| 04-Devon Williams | Toronto | 5 |
| 11-Marko Kovac | Windsor | 5 |
| 11-Tommy Hobbs | York | 5 |
| 11-Johneil Simpson | Brock | 6 |
| 05-Henry Tan | Lakehead | 6 |
| 21-Anthony Iacoe | Laurentian | 6 |
| 06-Will Coulthard | Laurier | 6 |
| 11-Mike L’Africain | Ottawa | 6 |
| 03-Jon Ravenhorst | Waterloo | 6 |
| 13-Isiah Osborne | Windsor | 6 |
Silhouette Method for finding the optimal number of clusters for the 2016-17 season
The Silhouette method suggests 2 as the optimal number of clusters for this season. This is possibly separating the players into forwards/centers and guards. Below are tables to show the average statistics for each cluster and also which players belong to which cluster.
Cluster Plot of 2016-17 Season
| Cluster 1 | Cluster 2 | |
|---|---|---|
| 3PointMade | 16.67 | 25.09 |
| 3PointAttempted | 51.80 | 73.85 |
| Assists | 30.02 | 42.82 |
| Blocks | 4.92 | 9.18 |
| DefensiveRebounds | 53.12 | 83.88 |
| FieldGoalMade | 55.86 | 101.97 |
| FieldGoalAttempted | 136.37 | 234.88 |
| FreeThrowsMade | 24.86 | 50.88 |
| FreeThrowsAttempted | 37.14 | 71.27 |
| Minutes | 452.90 | 559.09 |
| OffensiveRebounds | 20.53 | 30.91 |
| PersonalFouls | 35.53 | 44.12 |
| Points | 153.24 | 279.91 |
| Rebounds | 73.65 | 114.79 |
| Steals | 16.88 | 23.67 |
| Turnovers | 28.86 | 47.39 |
| Home | 9.14 | 9.70 |
| GamesPlayed | 18.43 | 19.12 |
| PointsPerGame | 8.34 | 14.70 |
| MinutesPerGame | 24.59 | 29.24 |
| 3P% | 0.28 | 0.30 |
| FG% | 0.41 | 0.44 |
| FT% | 0.68 | 0.71 |
| TrueShooting% | 0.50 | 0.52 |
| Player | Team | Cluster |
|---|---|---|
| 06-Nathan Riley | Algoma | 1 |
| 13-Reng Gum | Algoma | 1 |
| 09-Daniel Cayer | Brock | 1 |
| 14-Ryan Bennett | Brock | 1 |
| 25-Tyler Brown | Brock | 1 |
| 03-Marcus Anderson | Carleton | 1 |
| 42-Eddie Ekiyor | Carleton | 1 |
| 04-Daniel Dooley | Guelph | 1 |
| 05-Jonathan Wallace | Guelph | 1 |
| 11-Taylor Boers | Guelph | 1 |
| 15-Drew Walford | Guelph | 1 |
| 44-Ahmed Haroon | Guelph | 1 |
| 03-Nick Burke | Lakehead | 1 |
| 05-Henry Tan | Lakehead | 1 |
| 21-Alexandre Robichaud | Lakehead | 1 |
| 44-OJ Watson | Laurentian | 1 |
| 02-Matthew Minutillo | Laurier | 1 |
| 04-Chuder Teny | Laurier | 1 |
| 08-Vlad Matovic | Laurier | 1 |
| 10-Owen Coulthard | Laurier | 1 |
| 12-Elliot Ormond | McMaster | 1 |
| 44-Lazar Kojovic | McMaster | 1 |
| 06-Jordon Campbell | Nipissing | 1 |
| 07-Jerron Rhodes | Nipissing | 1 |
| 10-Ismael Kaba | Nipissing | 1 |
| 21-Justin Shaver | Nipissing | 1 |
| 22-Jaaden Lewis | Nipissing | 1 |
| 09-Matt Plunkett | Ottawa | 1 |
| 10-Brandon Robinson | Ottawa | 1 |
| 15-Brody Maracle | Ottawa | 1 |
| 24-Adam Presutti | Ottawa | 1 |
| 05-Isse Ibrahim | Queen’s | 1 |
| 08-Jesse Graham | Queen’s | 1 |
| 13-Sammy Ayisi | Queen’s | 1 |
| 14-Keevon Small | Ryerson | 1 |
| 15-Myles Charvis | Ryerson | 1 |
| 22-Juwon Grannum | Ryerson | 1 |
| 04-Reilly Reid | Toronto | 1 |
| 05-Sage Usher | Toronto | 1 |
| 21-Daniel Johansson | Toronto | 1 |
| 07-Ben Davis | Waterloo | 1 |
| 05-Eric McDonald | Western | 1 |
| 07-Jedson Tavernier | Western | 1 |
| 11-Cam Morris | Western | 1 |
| 13-Ian Smart | Western | 1 |
| 20-Nikola Farkic | Western | 1 |
| 20-Lucas Orlita | Windsor | 1 |
| 22-Tyler Persaud | Windsor | 1 |
| 10-Nidun Chandrakumar | York | 1 |
| 10-Sean Clendinning | Algoma | 2 |
| 22-Jermaine Lyle | Algoma | 2 |
| 11-Johneil Simpson | Brock | 2 |
| 13-Dani Elgadi | Brock | 2 |
| 10-Connor Wood | Carleton | 2 |
| 41-Kaza Kajami-Keane | Carleton | 2 |
| 24-Bacarius Dinkins | Lakehead | 2 |
| 10-Kadre Gray | Laurentian | 2 |
| 11-David Aromolaran | Laurentian | 2 |
| 23-Nelson Yengue | Laurentian | 2 |
| 12-Matt Chesson | Laurier | 2 |
| 13-Tevaun Kokko | Laurier | 2 |
| 11-Connor Gilmore | McMaster | 2 |
| 21-David McCulloch | McMaster | 2 |
| 22-Rohan Boney | McMaster | 2 |
| 13-Marcus Lewis | Nipissing | 2 |
| 05-Jean Emmanuel Pierre-Charles | Ottawa | 2 |
| 06-Caleb Agada | Ottawa | 2 |
| 12-Tanner Graham | Queen’s | 2 |
| 04-Ammanuel Diressa | Ryerson | 2 |
| 21-Adika Peter-McNeilly | Ryerson | 2 |
| 06-Devin Johnson | Toronto | 2 |
| 03-Jon Ravenhorst | Waterloo | 2 |
| 04-Simon Petrov | Waterloo | 2 |
| 20-Mike Pereira | Waterloo | 2 |
| 23-Justin Hardy | Waterloo | 2 |
| 42-Nedim Hodzic | Waterloo | 2 |
| 08-Eriq Jenkins | Western | 2 |
| 12-Omar Shiddo | Western | 2 |
| 05-Micqueel Martin | Windsor | 2 |
| 10-Mike Rocca | Windsor | 2 |
| 11-Jayden Frederick | York | 2 |
| 20-Brandon Ramirez | York | 2 |
Finding the Optimal Cluster using D-index for the 2017-18 season.
The optimal number of clusters suggested by the D-index method is 5 for this season. This is possibly separating the players into the actual posistions (Point Guard, Shooting Guard, Small Forward, Power Forward, and Center). Below are tables to show the average statistics for each cluster and also which players belong to which cluster.
Cluster Plot of 2017-18 Season
| Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Cluster 5 | |
|---|---|---|---|---|---|
| 3PointMade | 46.75 | 16.23 | 23.72 | 28.11 | 29.50 |
| 3PointAttempted | 127.75 | 49.32 | 71.44 | 85.00 | 86.33 |
| Assists | 57.92 | 33.95 | 41.36 | 64.17 | 85.33 |
| Blocks | 3.83 | 6.68 | 10.04 | 11.06 | 11.33 |
| DefensiveRebounds | 70.17 | 60.45 | 74.40 | 93.22 | 137.00 |
| FieldGoalMade | 133.00 | 53.00 | 81.40 | 109.00 | 156.33 |
| FieldGoalAttempted | 309.83 | 129.64 | 189.52 | 255.67 | 340.33 |
| FreeThrowsMade | 58.42 | 20.86 | 36.56 | 55.17 | 117.17 |
| FreeThrowsAttempted | 75.50 | 33.68 | 52.52 | 77.28 | 144.00 |
| Minutes | 606.67 | 496.41 | 571.40 | 703.78 | 750.17 |
| OffensiveRebounds | 22.17 | 23.73 | 26.32 | 30.72 | 40.17 |
| PersonalFouls | 49.33 | 44.00 | 47.08 | 52.89 | 54.00 |
| Points | 371.17 | 143.09 | 223.08 | 301.28 | 459.33 |
| Rebounds | 92.33 | 84.18 | 100.72 | 123.94 | 177.17 |
| Steals | 26.50 | 17.73 | 19.88 | 31.11 | 27.67 |
| Turnovers | 48.58 | 31.09 | 34.00 | 49.33 | 65.33 |
| Home | 11.25 | 11.36 | 11.44 | 11.56 | 11.50 |
| GamesPlayed | 22.25 | 22.50 | 22.72 | 23.22 | 23.17 |
| PointsPerGame | 16.82 | 6.44 | 9.88 | 13.00 | 19.95 |
| MinutesPerGame | 27.36 | 22.11 | 25.22 | 30.34 | 32.43 |
| 3P% | 0.37 | 0.27 | 0.29 | 0.30 | 0.27 |
| FG% | 0.43 | 0.41 | 0.43 | 0.43 | 0.46 |
| FT% | 0.77 | 0.64 | 0.71 | 0.71 | 0.82 |
| TrueShooting% | 0.54 | 0.48 | 0.52 | 0.52 | 0.56 |
| Player | Team | Cluster |
|---|---|---|
| 10-Ian Nash | Algoma | 1 |
| 11-Johneil Simpson | Brock | 1 |
| 35-Cassidy Ryan | Brock | 1 |
| 10-Yasiin Joseph | Carleton | 1 |
| 08-Mor Menashe | Lakehead | 1 |
| 06-Ali Sow | Laurier | 1 |
| 11-Tevaun Kokko | Laurier | 1 |
| 11-Miles Seward | McMaster | 1 |
| 22-Jaz Bains | Queen’s | 1 |
| 04-Manny Diressa | Ryerson | 1 |
| 10-Marko Kovac | Western | 1 |
| 12-Omar Shiddo | Western | 1 |
| 07-Pedro Costa | Algoma | 2 |
| 09-Kascius Small-Martin | Brock | 2 |
| 03-Marcus Anderson | Carleton | 2 |
| 15-Drew Walford | Guelph | 2 |
| 03-Darnell Curtin | Lakehead | 2 |
| 24-Litha Ncanisa | Laurentian | 2 |
| 03-Ntore Habimana | Laurier | 2 |
| 25-Andre Toic | McMaster | 2 |
| 05-Marvin Ngonadi | Nipissing | 2 |
| 07-Jerron Rhodes | Nipissing | 2 |
| 10-Ismael Kaba | Nipissing | 2 |
| 12-Gage Sabean | Ottawa | 2 |
| 15-Brody Maracle | Ottawa | 2 |
| 04-Harry Range | Queen’s | 2 |
| 10-Filip Vujadinovic | Ryerson | 2 |
| 20-Nikola Farkic | Western | 2 |
| 15-Damian Persaud | Windsor | 2 |
| 21-Lucas Wood | Windsor | 2 |
| 10-Gene Spagnuolo | York | 2 |
| 11-Prince Kamunga | York | 2 |
| 13-Nana Adu-Poku | York | 2 |
| 15-Ricky Hudson | York | 2 |
| 09-Cailum White | Algoma | 3 |
| 13-Reng Gum | Algoma | 3 |
| 22-Jermaine Lyle | Algoma | 3 |
| 15-Daniel Cayer | Brock | 3 |
| 25-Tyler Brown | Brock | 3 |
| 13-Munis Tutu | Carleton | 3 |
| 42-Eddie Ekiyor | Carleton | 3 |
| 05-Jonathan Wallace | Guelph | 3 |
| 11-Taylor Boers | Guelph | 3 |
| 21-Anthony Iacoe | Laurentian | 3 |
| 02-Matt Minutillo | Laurier | 3 |
| 10-Matt Quiring | McMaster | 3 |
| 02-Sean Stoqua | Ottawa | 3 |
| 03-Calvin Epistola | Ottawa | 3 |
| 05-Jean Emmanuel Pierre-Charles | Ottawa | 3 |
| 06-Mike Shoveller | Queen’s | 3 |
| 07-Quinton Gray | Queen’s | 3 |
| 05-Roshane Roberts | Ryerson | 3 |
| 11-Christopher Barrett | Toronto | 3 |
| 21-Daniel Johansson | Toronto | 3 |
| 22-Nikola Paradina | Toronto | 3 |
| 15-David Ramon Prados | Waterloo | 3 |
| 09-Henry Tan | Western | 3 |
| 05-Anthony Zrvnar | Windsor | 3 |
| 08-Gianmarco Luciani | York | 3 |
| 06-Nathan Riley | Algoma | 4 |
| 04-Daniel Dooley | Guelph | 4 |
| 23-Nick Burke | Lakehead | 4 |
| 11-David Aromolaran | Laurentian | 4 |
| 23-Nelson Yengue | Laurentian | 4 |
| 04-Kareem Collins | McMaster | 4 |
| 13-Marcus Lewis | Nipissing | 4 |
| 22-Jaaden Lewis | Nipissing | 4 |
| 10-Brandon Robinson | Ottawa | 4 |
| 12-Tanner Graham | Queen’s | 4 |
| 07-Myles Charvis | Ryerson | 4 |
| 08-Jean-Victor Mukama | Ryerson | 4 |
| 04-Reilly Reid | Toronto | 4 |
| 05-Sage Usher | Toronto | 4 |
| 20-Justin Hardy | Waterloo | 4 |
| 08-Eriq Jenkins | Western | 4 |
| 11-Marcus Jones | Windsor | 4 |
| 20-Lucas Orlita | Windsor | 4 |
| 13-Dani Elgadi | Brock | 5 |
| 10-Kadre Gray | Laurentian | 5 |
| 21-David McCulloch | McMaster | 5 |
| 04-Simon Petrov | Waterloo | 5 |
| 42-Nedim Hodzic | Waterloo | 5 |
| 10-Mike Rocca | Windsor | 5 |
Finding the Optimal Cluster using D-index for the 2018-19 season.
The optimal number of clusters suggested for the 2018-19 season is 4. Below are tables to show the average statistics for each cluster and also which players belong to which cluster.
Cluster Plot of 2018-19 Season
| Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | |
|---|---|---|---|---|
| 3PointMade | 25.24 | 75.50 | 18.74 | 38.00 |
| 3PointAttempted | 79.55 | 205.75 | 58.26 | 106.56 |
| Assists | 51.12 | 70.75 | 29.81 | 53.06 |
| Blocks | 9.03 | 4.25 | 6.85 | 11.12 |
| DefensiveRebounds | 83.12 | 96.00 | 60.33 | 114.81 |
| FieldGoalMade | 91.48 | 192.75 | 62.52 | 132.25 |
| FieldGoalAttempted | 215.67 | 435.75 | 150.70 | 308.19 |
| FreeThrowsMade | 40.79 | 105.00 | 28.93 | 81.19 |
| FreeThrowsAttempted | 58.48 | 130.75 | 40.74 | 108.38 |
| Minutes | 613.58 | 775.25 | 478.74 | 708.94 |
| OffensiveRebounds | 28.67 | 22.00 | 20.56 | 36.44 |
| PersonalFouls | 52.00 | 48.50 | 40.37 | 50.38 |
| Points | 249.00 | 566.00 | 172.70 | 383.69 |
| Rebounds | 111.79 | 118.00 | 80.89 | 151.25 |
| Steals | 24.64 | 31.00 | 15.96 | 24.69 |
| Turnovers | 37.42 | 68.00 | 29.00 | 50.12 |
| Home | 11.55 | 11.50 | 10.33 | 11.38 |
| GamesPlayed | 22.73 | 23.00 | 20.52 | 22.88 |
| PointsPerGame | 11.00 | 24.88 | 8.65 | 16.80 |
| MinutesPerGame | 27.05 | 33.72 | 23.51 | 30.99 |
| 3P% | 0.29 | 0.36 | 0.29 | 0.33 |
| FG% | 0.43 | 0.44 | 0.42 | 0.43 |
| FT% | 0.69 | 0.79 | 0.70 | 0.75 |
| TrueShooting% | 0.51 | 0.57 | 0.51 | 0.54 |
| Player | Team | Cluster |
|---|---|---|
| 03-Elijah Butler | Algoma | 1 |
| 08-David Bokanga | Algoma | 1 |
| 15-Daniel Cayer | Brock | 1 |
| 25-Tyler Brown | Brock | 1 |
| 11-Tj Lall | Carleton | 1 |
| 13-Munis Tutu | Carleton | 1 |
| 42-Eddie Ekiyor | Carleton | 1 |
| 22-Rasheed Weekes | Guelph | 1 |
| 08-Lock Lam | Lakehead | 1 |
| 23-Nick Burke | Lakehead | 1 |
| 21-Anthony Iacoe | Laurentian | 1 |
| 02-Matt Minutillo | Laurier | 1 |
| 03-Ntore Habimana | Laurier | 1 |
| 05-Jackson Mayers | Laurier | 1 |
| 11-Justin Hill | Nipissing | 1 |
| 03-Calvin Epistola | Ottawa | 1 |
| 07-Mackenzie Morrison | Ottawa | 1 |
| 10-Brandon Robinson | Ottawa | 1 |
| 07-Quinton Gray | Queen’s | 1 |
| 23-Jayden Frederick | Ryerson | 1 |
| 09-Evan Shadkami | Toronto | 1 |
| 11-Christopher Barrett | Toronto | 1 |
| 21-Daniel Johansson | Toronto | 1 |
| 22-Nikola Paradina | Toronto | 1 |
| 08-Eriq Jenkins | Western | 1 |
| 13-Julian Walker | Western | 1 |
| 20-Nikola Farkic | Western | 1 |
| 08-Chris Poloniato | Windsor | 1 |
| 11-Telloy Simon | Windsor | 1 |
| 14-Thomas Kennedy | Windsor | 1 |
| 20-Lucas Orlita | Windsor | 1 |
| 02-Chevon Brown | York | 1 |
| 05-DeAndrae Pierre | York | 1 |
| 11-Johneil Simpson | Brock | 2 |
| 10-Kadre Gray | Laurentian | 2 |
| 06-Ali Sow | Laurier | 2 |
| 12-Omar Shiddo | Western | 2 |
| 10-Michael Vos Otin | Brock | 3 |
| 03-Marcus Anderson | Carleton | 3 |
| 10-Yasiin Joseph | Carleton | 3 |
| 05-Aaron Nugent | Guelph | 3 |
| 21-Davarius Wright | Lakehead | 3 |
| 22-Josis Mikia-Thomas | Laurentian | 3 |
| 24-Litha Ncanisa | Laurentian | 3 |
| 32-Gaetan Chamand | Laurentian | 3 |
| 23-Sefa Otchere | McMaster | 3 |
| 32-Jordan Henry | McMaster | 3 |
| 05-Marvin Ngonadi | Nipissing | 3 |
| 08-Jordan Roberts | Nipissing | 3 |
| 12-Quintin Ashitei | Nipissing | 3 |
| 04-Harry Range | Queen’s | 3 |
| 05-Yusuf Ali | Ryerson | 3 |
| 10-Filip Vujadinovic | Ryerson | 3 |
| 14-Keevon Small | Ryerson | 3 |
| 04-Simon Petrov | Waterloo | 3 |
| 05-Colin Connors | Waterloo | 3 |
| 07-Jeff Baradziej | Waterloo | 3 |
| 15-David Ramon Prados | Waterloo | 3 |
| 23-Justin Malnerich | Waterloo | 3 |
| 09-Marko Kovac | Western | 3 |
| 10-Anthony Zrvnar | Windsor | 3 |
| 04-Prince Kamunga | York | 3 |
| 08-Gianmarco Luciani | York | 3 |
| 10-Gene Spagnuolo | York | 3 |
| 06-Nathan Riley | Algoma | 4 |
| 35-Cassidy Ryan | Brock | 4 |
| 15-Malcolm Glanville | Guelph | 4 |
| 32-Tommy Yanchus | Guelph | 4 |
| 40-Banky Alade | Guelph | 4 |
| 01-Isaiah Traylor | Lakehead | 4 |
| 11-Connor Gilmore | McMaster | 4 |
| 21-David McCulloch | McMaster | 4 |
| 13-Marcus Lewis | Nipissing | 4 |
| 12-Gage Sabean | Ottawa | 4 |
| 41-Guillaume Pepin | Ottawa | 4 |
| 03-Jaz Bains | Queen’s | 4 |
| 12-Tanner Graham | Queen’s | 4 |
| 07-Myles Charvis | Ryerson | 4 |
| 08-JV Mukama | Ryerson | 4 |
| 42-Nedim Hodzic | Waterloo | 4 |
Unsupervised learning used on basketball data can be very helpful. It can be used to categorize players and to see what their style of play is. It can also be used for match-ups and for predicting important players. For instance, if you find that a player was in the same cluster as the catch and shoot players, a coach can assign an appropriate defender. Knowing the style of play for your opponents is very useful for defensive purposes. In my opinion the higher the number of clusters assigned, the better because it would distinguish the type of player more.
In machine learning and statistics, classification is a supervised learning approach in which the machine learns from the data input given to it and then uses this learning to classify new observations. In this case, classification can be used to identify a win and loss and also to predict whether a game will be a win or loss. That means we want to identify which variables are the most important in distinguishing a win (or a loss_. There are many types of classification techniques such as Random Forests, Support Vector Machines, Logistic Regression, XGBoost, etc..
Random forest is an ensemble learning method for classification. Ensemble methods are very effective because they use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the learning algorithms alone. A random forest consists of a large number of decision trees that operate as an ensemble. Each individual tree in the random forest gives a prediction of outcome and the class with the most votes becomes the model’s prediction[11]. The reason why a random forest is a great technique is because a large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models. Random forests also give an importance score for all the features used in the model. A standard procedure is to first use all the variables and then use feature importance to narrow the model down to get more accurate results.
In this random forest model we are predicting wins using the following predictors: Assists,DefensiveRebounds,TotalRebounds,Turnovers,PushBallfromTurnover,Steals, PressOffense,UnguardedJumpShots,AllFreeThrows,P&RBallHandler-SingleCovered,Cuts, GuardedJumpShots,ShortJumpShots,TransitionOffense,LongJumpShots,Transitions,SpotUps, P&RBallHandler-DefenseCommits,PushBallfromShotAttempt,PushBalltoHalfCourtOff., OffensiveRebounds,MiscellaneousPossessions,Isolation-SingleCovered,Post-Up-SingleCovered, MediumJumpShots,Blocks,OffScreens,Off.Reb.-PutBacks,Handoffs,Off.Reb.-ResetOffense, TransitionTurnover,P&RRollMan,Isolation-DefenseCommits,Post-Up-DefenseCommits, Post-Up-HardDoubleTeam,P&RBallHandler-Traps. The model is trained on a train set which is a random sample (without replacement) of 70% of the dataset and tested on a random sample of 30% of the dataset. The accuracy score is obtained below.
Accuracy: 0.7454128440366973
A very important perk of the random forest algorithm is it allows us to obtain the Feature importance to let us know which variables were the most important for creating the model, i.e. which features are the most important in classifying and predicting wins. A table of the Feature importance from this model is shown below.
| Feature | Feature Importance Value |
|---|---|
| Assists | 0.096881 |
| DefensiveRebounds | 0.062817 |
| TotalRebounds | 0.059851 |
| Turnovers | 0.056744 |
| PushBallfromTurnover | 0.044454 |
| Steals | 0.038865 |
| PressOffense | 0.037174 |
| Unguarded Jump Shots | 0.030118 |
| AllFreeThrows | 0.029086 |
| P&RBallHandler-SingleCovered | 0.028485 |
| Cuts | 0.027246 |
| GuardedJumpShots | 0.026854 |
| ShortJumpShots | 0.025915 |
| TransitionOffense | 0.025691 |
| LongJumpShots | 0.023443 |
| Transitions | 0.023180 |
| SpotUps | 0.023102 |
| P&RBallHandler-DefenseCommits | 0.022575 |
| PushBallfromShotAttempt | 0.022505 |
| PushBalltoHalfCourtOff. | 0.022279 |
| OffensiveRebounds | 0.022109 |
| MiscellaneousPossessions | 0.020995 |
| Isolation-SingleCovered | 0.020878 |
| Post-Up-SingleCovered | 0.020679 |
| MediumJumpShots | 0.019900 |
| Blocks | 0.019759 |
| OffScreens | 0.018904 |
| Off.Reb.-PutBacks | 0.017528 |
| Handoffs | 0.017504 |
| Off.Reb.-ResetOffense | 0.017282 |
| TransitionTurnover | 0.017150 |
| P&RRollMan | 0.015011 |
| Isolation-DefenseCommits | 0.013565 |
| Post-Up-DefenseCommits | 0.013291 |
| Post-Up-HardDoubleTeam | 0.012348 |
| P&RBallHandler-Traps | 0.005832 |
Note: since random forests take samples randomly, the accuracy will vary depending on the seed chosen.
In this random forest model we are predicting wins using a refined selection of predictors: Assists,DefensiveRebounds,TotalRebounds,Turnovers,Steals,PushBallfromTurnover,PressOffense,AllFreeThrows,UnguardedJumpShots,ShortJumpShots,Cuts,LongJumpShots, Transitions,GuardedJumpShots,PushBallfromShotAttempt,AllP&RBallHandler,SpotUps, AllPost-Up,PushBalltoHalfCourtOff.,AllOffensiveRebounds,MiscellaneousPossessions, AllIsolation,OffScreens,Blocks,MediumJumpShots,Isolation-SingleCovered,Handoffs.The model is trained on a train set which is a random sample (without replacement) of 70% of the dataset and tested on a random sample of 30% of the dataset. The accuracy score is obtained below.
Accuracy: 0.7477064220183486
A table of the Feature importance from this model is shown below
| Feature | Feature Importance Value |
|---|---|
| Assists | 0.103865 |
| DefensiveRebounds | 0.079669 |
| TotalRebounds | 0.064047 |
| Turnovers | 0.062428 |
| Steals | 0.059118 |
| PushBallfromTurnover | 0.048081 |
| PressOffense | 0.039639 |
| AllFreeThrows | 0.034666 |
| Unguarded Jump Shots | 0.033493 |
| ShortJumpShots | 0.032340 |
| Cuts | 0.031976 |
| LongJumpShots | 0.030290 |
| Transitions | 0.029443 |
| GuardedJumpShots | 0.029287 |
| PushBallfromShotAttempt | 0.028180 |
| AllP&RBallHandler | 0.028030 |
| SpotUps | 0.027917 |
| AllPost-Up | 0.027047 |
| PushBalltoHalfCourtOff. | 0.025952 |
| AllOffensiveRebounds | 0.025533 |
| MiscellaneousPossessions | 0.024643 |
| AllIsolation | 0.024239 |
| OffScreens | 0.023363 |
| Blocks | 0.022834 |
| MediumJumpShots | 0.022212 |
| Isolation-SingleCovered | 0.021883 |
| Handoffs | 0.019824 |
Assists are the most important feature in both models for classifying whether a game is a win or loss.
Logistic Regression is a form of regression that is used when the response variable is a categorical variable [12]. In this case it is a binary value (e.g. Success or Failure). The game by game data can be used to create a model that predicts Wins.
The same features are used in this model as the second model in the Random Forests section. Logistic regression will be used to classify and then predict wins. The equation is below
\[ Win = \beta_0 + {\beta_1*Assists} + ... + {\beta_{27}*Handoffs} \]
Again, the model is trained on a train set which is a random sample (without replacement) of 70% of the dataset and tested on a random sample of 30% of the dataset. The accuracy score is obtained below.
Accuracy: 0.7821100917431193
Feature Importance from Logistic Regression Model.
The feature importance from Logistic Regression differs from Random Forests.Although, Defensive Rebounds, Assists, and Turnovers are still on the top of the list. In general, turnovers negatively impact teams and can be an important feature to distinguish teams that are less likely to win if they make more turnovers.
A dataset has been modified to subtract the home team’s statistics from the away team’s statistics for each game so that there are differential statistics. The differential statistics were compared to see which contributed to the highest proportion of wins.
| Differential Statistics | Proportion of Wins |
|---|---|
| Positive Assists Differential | 845/1171 = 72.2% |
| Positive Rebounds Differential | 817/1171 = 69.8% |
| Negative Turnovers Differential | 726/1171 = 62% |
2 by 2 table analysis:
------------------------------------------------------
Outcome : Win
Comparing : Positive Assists Differential vs. Negative Assists Differential
Win Lose P(Win) 95% conf. interval
Positive Assists Differential 845 326 0.7216 0.6952 0.7465
Negative Assists Differential 326 845 0.2784 0.2535 0.3048
95% conf. interval
Relative Risk: 2.5920 2.3481 2.8613
Sample Odds Ratio: 6.7186 5.6078 8.0494
Conditional MLE Odds Ratio: 6.7123 5.5848 8.0849
Probability difference: 0.4432 0.4059 0.4784
Exact P-value: 0.0000
Asymptotic P-value: 0.0000
------------------------------------------------------
Above is a two-by-two table analysis. The Sample Odds Ratio tells us that odds of a team winning is 6.7 higher given they have more assists than their opponent compared to teams that have fewer assists than their opponent. The Relative Risk tells us that teams with more assists than their opponent have 2.59 times the ‘risk’ of winning compared to teams with fewer assists than their opponent.
Assists can lead to effective scoring. A player is getting set up for a shot and each team can distribute their shots differently. A study was done in the NBA (Pelechrinis, Konstantinos, 2019) [13] that has shown that on average an assisted shot added 0.16 expected points more compared to an unassisted shot. If teams looked for the extra pass on 15 of their unassisted shots, this corresponds to approximately 2.4 additional expected points over the course of the game. An assist can increase the average field goal percentage of a type of shot as opposed to an unassisted shot (Pelechrinis, Konstantinos, 2019). Also, assists are necessary for effective play making. As seen previously, transitions, spot-ups and cuts are all very effective offensive plays and the thing that connects them together is an assist.
Using Synergy’s Multi-Game Shot Chart it is possible to see the difference in shooting efficiency between shots derived from an assist and shots that were not.
Side-by-Side shot chart of Carleton’s 2018-19 season. The left side is the shot chart of the entire season without any filters. The right side shows the chart of the entire season where shots were derived from passing plays.
[1] “Sports Analytics.” Wikipedia, Wikimedia Foundation, 29 June 2019, en.wikipedia.org/wiki/Sports_analytics.
[2] “Synergy Sports Technology.” Synergy Sports Technology, corp.synergysportstech.com/.
[3] “Ontario University Athletics (OUA).” OUA, www.oua.ca/landing/index.
[4] Bilder, Christopher R., and Thomas M. Loughin. Analysis of Categorical Data with R. CRC Press, 2015.
[5] Belhumeur, Kevin. “How Important Is Home-Court Advantage in the NBA?” Bleacher Report, Bleacher Report, 3 Oct. 2017, bleacherreport.com/articles/1520496-how-important-is-home-court-advantage-in-the-nba.
[6] U Sports Hoops - University Basketball in Canada, usportshoops.ca/history/team-history.php?Gender=MBB&Team=Carleton.
[7] Brid, Rajesh S. “Decision Trees - A Simple Way to Visualize a Decision.” Medium, GreyAtom, 26 Oct. 2018, medium.com/greyatom/decision-trees-a-simple-way-to-visualize-a-decision-dc506a403aeb.
[8] Haefner, Jeff. “9 Stats That Every Serious Basketball Coach Should Track.” Welcome to BREAKTHROUGH BASKETBALL, 2013, www.breakthroughbasketball.com/stats/9_stats_basketball_coach_should_track.html?source=post_page—–6eac3c43a096———————-.
[9] “NbClust.” Function | R Documentation, www.rdocumentation.org/packages/NbClust/versions/1.0/topics/NbClust.
[10] “Finding Optimal Number of Clusters.” R, 9 Feb. 2017, www.r-bloggers.com/finding-optimal-number-of-clusters/.
[11] “Understanding Random Forest.” Medium, Towards Data Science, 4 Aug. 2019, towardsdatascience.com/understanding-random-forest-58381e0602d2.
[12] “Logistic Regression.” Wikipedia, Wikimedia Foundation, 30 July 2019, en.wikipedia.org/wiki/Logistic_regression.
[13] Pelechrinis, Konstantinos. “Data Reveals the Value of an Assist in Basketball.” The Conversation, 8 July 2019, theconversation.com/data-reveals-the-value-of-an-assist-in-basketball-113893.
import re
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
import csv
from collections import defaultdict
#import pprint
import csv
def get_links():
print("getting links...")
teams = ['algoma', 'brock', 'carleton', 'guelph', 'lakehead', 'laurentian',
'laurier', 'mcmaster', 'nipissing', 'ottawa', 'queens', 'ryerson',
'toronto', 'waterloo', 'western', 'windsor', 'york']
years = ['2014-15', '2015-16', '2016-17', '2017-18', '2018-19']
original_url = 'http://oua.ca/sports/mbkb/'
end_url = '?view=gamelog'
href_list = []
for year in years:
for team in teams:
current_url = original_url + year + '/teams/' + team + end_url
r = requests.get(current_url)
raw_html = r.content
soup = BeautifulSoup(raw_html, 'html.parser')
tables = soup.findAll('table')
max_len = 0
index = 0
for i in range(len(tables)):
tags = tables[i].findAll('a')
if len(tags) > 0:
url = tags[0].get('href', None)
if "/boxscores/20" in url and len(tables[i]) > max_len:
index = i
max_len = len(tables[i])
table = tables[index]
tags = table.findAll('a')
for tag in tags:
url = re.sub("\.\.", original_url + year, tag.get('href', None))
url += '?view=teamstats'
href_list.append(url)
print("done getting links")
return href_list
def scrape(url):
""" This function is used to create data
dictionaries for any url of team stats in the oua website.
It takes an array of urls but for some games there
are extra fields to look out for. This function is
for those that do not have those extra fields in the table"""
# create dictionary for links with less fields in table
dictlist = {}
for i in range(len(url)):
print(url[i])
r = requests.get(url[i])
raw_html = r.content
soup = BeautifulSoup(raw_html, 'html.parser')
soup[url[i]] = BeautifulSoup(raw_html, 'html.parser')
stats = soup[url[i]].findAll("table")
scores = soup[url[i]].findAll('div',
{'class': 'teams clearfix'})[0].table
# some links have different amounts of tables
and sometimes the team stats table is different
table = stats[8]
for j in range(2, len(stats)):
if str(stats[j].caption) ==
'<caption class=
"caption offscreen">
<h2>Team Statistics</h2></caption>':
table=stats[j]
break
dictlist[url[i]] = {}
d = {}
dictlist[url[i]] = {
"Away" : table.findAll('th', {'scope': 'col'})[1].text.strip(),
"Home" : table.findAll('th', {'scope': 'col'})[2].text.strip(),
}
try:
winner = scores.findAll('tr', {'class': 'winner'})[0]
except IndexError:
d[None] = None
try:
loser = scores.findAll('tr', {'class': 'loser'})[0]
except IndexError:
d[None] = None
try:
dictlist[url[i]].update({"Winner": winner.th.text.strip()})
except IndexError:
d[None] = None
try:
dictlist[url[i]].update({"Loser": loser.th.text.strip()})
except IndexError:
d[None] = None
# for k in range(1,6):
# dictlist[url[i]].update({"Winner Qtr" +k +Pts"})
try:
dictlist[url[i]].update({"Winner 1st Qtr Pts":
winner.findAll('td')[0].text.strip()})
except IndexError:
d[None] = None
try:
dictlist[url[i]].update({"Loser 1st Qtr Pts":
loser.findAll('td')[0].text.strip()})
except IndexError:
d[None] = None
try:
dictlist[url[i]].update({"Winner 2nd Qtr Pts":
winner.findAll('td')[1].text.strip()})
except IndexError:
d[None] = None
try:
dictlist[url[i]].update({"Loser 2nd Qtr Pts":
loser.findAll('td')[1].text.strip()})
except IndexError:
d[None] = None
try:
dictlist[url[i]].update({"Winner 3rd Qtr Pts":
winner.findAll('td')[2].text.strip()})
except IndexError:
d[None] = None
try:
dictlist[url[i]].update({"Loser 3rd Qtr Pts":
loser.findAll('td')[2].text.strip()})
except IndexError:
d[None] = None
try:
dictlist[url[i]].update({"Winner 4th Qtr Pts":
winner.findAll('td')[3].text.strip()})
except IndexError:
d[None] = None
try:
dictlist[url[i]].update({"Loser 4th Qtr Pts":
loser.findAll('td')[3].text.strip()})
except IndexError:
d[None] = None
try:
dictlist[url[i]].update( {"Winner Total Pts":
winner.findAll('td')[4].text.strip()})
except IndexError:
d[None] = None
try:
dictlist[url[i]].update( {"Loser Total Pts":
loser.findAll('td')[4].text.strip()})
except IndexError:
d[None] = None
for j in range(16):
try:
dictlist[url[i]].update( { table.findAll('th',
{'scope': 'row'})[j].text.strip() + ' Away':
table.findAll('td')[2*j].
text.strip()})
except IndexError:
d[None] = None
try:
dictlist[url[i]].update({ table.findAll('th',
{'scope': 'row'})[j].text.strip() + ' Home' :
table.findAll('td')[2*j+1].
text.strip()})
except IndexError:
d[None] = None
try:
dictlist[url[i]].update({table.findAll('th',
{'scope': 'row'})[16].text.strip()+' Away':
table.findAll('td')[32].
text.strip()})
except IndexError:
d[None] = None
z = {**dictlist, **d}
return z
if __name__ == '__main__':
q = get_links()
a = scrape(q)
df = pd.DataFrame(a)
df = df.T
df = df.replace('\-', ' -- ', regex=True).astype(object)
df = df[['Away', 'FG Away', 'FG% Away', '3PT FG Away',
'3PT FG% Away', 'FT Away', 'FT% Away', 'Rebounds Away',
'Assists Away',
'Turnovers Away', 'Points Off Turnovers Away',
'2nd Chance Points Away', 'Points in the Paint Away',
'Fastbreak Points Away', 'Bench Points Away',
'Largest Lead Away', 'Time of Largest Lead Away',
'Home', 'FG Home',
'FG% Home', '3PT FG Home', '3PT FG% Home',
'FT Home', 'FT% Home', 'Rebounds Home', 'Assists Home',
'Turnovers Home','Points Off Turnovers Home',
'2nd Chance Points Home', 'Points in the Paint Home',
'Fastbreak Points Home',
'Bench Points Home', 'Largest Lead Away',
'Time of Largest Lead Away', 'Trends Away', 'Winner',
'Winner 1st Qtr Pts',
'Winner 2nd Qtr Pts', 'Winner 3rd Qtr Pts',
'Winner 4th Qtr Pts', 'Winner Total Pts', 'Loser',
'Loser 1st Qtr Pts', 'Loser 2nd Qtr Pts',
'Loser 3rd Qtr Pts', 'Loser 4th Qtr Pts', 'Loser Total Pts']]
df.to_csv('gbyg.csv', header=True)
import pdb; pdb.set_trace()
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
import re
import itertools
def get_links():
print("getting links...")
teams = ['algoma', 'brock', 'carleton', 'guelph',
'lakehead', 'laurentian',
'laurier', 'mcmaster', 'nipissing',
'ottawa', 'queens', 'ryerson',
'toronto', 'waterloo', 'western', 'windsor', 'york']
years = ['2014-15', '2015-16', '2016-17', '2017-18', '2018-19']
original_url = 'http://oua.ca/sports/mbkb/'
end_url = '?view=gamelog'
href_list = []
for year in years:
for team in teams:
current_url = original_url + year +
'/teams/' + team + end_url
r = requests.get(current_url)
raw_html = r.content
soup = BeautifulSoup(raw_html, 'html.parser')
tables = soup.findAll('table')
max_len = 0
index = 0
for i in range(len(tables)):
tags = tables[i].findAll('a')
if len(tags) > 0:
url = tags[0].get('href', None)
if "/boxscores/20" in url
and len(tables[i]) > max_len:
index = i
max_len = len(tables[i])
table = tables[index]
tags = table.findAll('a')
for tag in tags:
url = re.sub("\.\.", original_url + year,
tag.get('href', None))
url += '?view=teamstats'
href_list.append(url)
print("done getting links")
return href_list
def vsplayers_scrape(url):
vslist = {}
homelist = {}
for j in range(len(url)):
print(url[j])
r = requests.get(url[j])
raw_html = r.content
soup = BeautifulSoup(raw_html, 'html.parser')
soup[url[j]] = BeautifulSoup(raw_html, 'html.parser')
boxscore = soup[url[j]].find_all('article',
{'class': 'game-boxscore bkb clearfix'})
players = boxscore[0].find_all('div',
{'class': 'player-stats'})
team1 = players[0].find_all('div',
{'class': 'stats-wrap clearfix'})
visitorteam = team1[0].find_all('div',
{'class': 'stats-box full lineup visitor clearfix'})
team2 = players[0].find_all('div',
{'class': 'stats-wrap clearfix'})[1]
hometeam = team2.find_all('div',
{'class': 'stats-box full lineup home clearfix'})
hometbody = hometeam[0].find_all('tbody')
hometr = hometbody[1].find_all('tr')
visitortbody = visitorteam[0].find_all('tbody')
visitortr = visitorteam[0].find_all('tr')
vslist[url[j]] = {}
for k in range(len(visitorteam[0].find_all('tbody'))):
if visitorteam[0].find_all('tbody')[k].
tr.text.strip() == str('STARTERS'):
starters = visitorteam[0].find_all('tbody')[k]
starterstr = starters.find_all('tr')
elif visitorteam[0].find_all('tbody')[k].
tr.text.strip() == str('RESERVES'):
reserves = visitorteam[0].find_all('tbody')[k]
reservestr = reserves.find_all('tr')
if len(starterstr) > 4:
for i in range((len(starters.find_all('th'))) - 1):
vslist[url[j],i] = {
'Away' : visitorteam[0].caption.text.strip(),
visitorteam[0].thead.th.text.strip() :
starters.find_all('th')[i+1].text.strip(),
visitorteam[0].find_all('th')[1].text.strip() :
starterstr[i+1].td.text.strip(),
visitorteam[0].find_all('th')[2].text.strip() :
starterstr[i+1].find_all('td')[1].text.strip(),
visitorteam[0].find_all('th')[3].text.strip() :
starterstr[i+1].find_all('td')[2].text.strip(),
visitorteam[0].find_all('th')[4].text.strip() :
starterstr[i+1].find_all('td')[3].text.strip(),
visitorteam[0].find_all('th')[5].text.strip() :
starterstr[i+1].find_all('td')[4].text.strip(),
visitorteam[0].find_all('th')[6].text.strip() :
starterstr[i+1].find_all('td')[5].text.strip(),
visitorteam[0].find_all('th')[7].text.strip():
starterstr[i+1].find_all('td')[6].text.strip(),
visitorteam[0].find_all('th')[8].text.strip():
starterstr[i+1].find_all('td')[7].text.strip(),
visitorteam[0].find_all('th')[9].text.strip():
starterstr[i+1].find_all('td')[8].text.strip(),
visitorteam[0].find_all('th')[10].text.strip():
starterstr[i+1].find_all('td')[9].text.strip(),
visitorteam[0].find_all('th')[11].text.strip():
starterstr[i+1].find_all('td')[10].text.strip(),
visitorteam[0].find_all('th')[12].text.strip():
starterstr[i+1].find_all('td')[11].text.strip(),
visitorteam[0].find_all('th')[13].text.strip():
starterstr[i+1].find_all('td')[12].text.strip(),
}
return vslist
def vrplayers_scrape(url):
vrlist = {}
homelist = {}
for j in range(len(url)):
print(url[j])
r = requests.get(url[j])
raw_html = r.content
soup = BeautifulSoup(raw_html, 'html.parser')
soup[url[j]] = BeautifulSoup(raw_html, 'html.parser')
boxscore = soup[url[j]].find_all('article',
{'class': 'game-boxscore bkb clearfix'})
players = boxscore[0].find_all('div', {'class': 'player-stats'})
team1 = players[0].find_all('div', {'class': 'stats-wrap clearfix'})
visitorteam = team1[0].find_all('div',
{'class': 'stats-box full lineup visitor clearfix'})
team2 = players[0].find_all('div',
{'class': 'stats-wrap clearfix'})[1]
hometeam = team2.find_all('div',
{'class': 'stats-box full lineup home clearfix'})
hometbody = hometeam[0].find_all('tbody')
hometr = hometbody[1].find_all('tr')
visitortbody = visitorteam[0].find_all('tbody')
visitortr = visitorteam[0].find_all('tr')
vrlist[url[j]] = {}
for k in range(len(visitorteam[0].find_all('tbody'))):
if visitorteam[0].find_all('tbody')[k].
tr.text.strip() == str('STARTERS'):
starters = visitorteam[0].find_all('tbody')[k]
starterstr = starters.find_all('tr')
elif visitorteam[0].find_all('tbody')[k].
tr.text.strip() == str('RESERVES'):
reserves = visitorteam[0].find_all('tbody')[k]
reservestr = reserves.find_all('tr')
if len(reservestr) > 0:
for i in range((len(reserves.find_all('th'))) - 1):
vrlist[url[j],i] = {
'Away' : visitorteam[0].caption.text.strip(),
visitorteam[0].thead.th.text.strip() :
reserves.find_all('th')[i+1].text.strip(),
visitorteam[0].find_all('th')[1].text.strip() :
reservestr[i+1].td.text.strip(),
visitorteam[0].find_all('th')[2].text.strip() :
reservestr[i+1].find_all('td')[1].text.strip(),
visitorteam[0].find_all('th')[3].text.strip() :
reservestr[i+1].find_all('td')[2].text.strip(),
visitorteam[0].find_all('th')[4].text.strip() :
reservestr[i+1].find_all('td')[3].text.strip(),
visitorteam[0].find_all('th')[5].text.strip() :
reservestr[i+1].find_all('td')[4].text.strip(),
visitorteam[0].find_all('th')[6].text.strip() :
reservestr[i+1].find_all('td')[5].text.strip(),
visitorteam[0].find_all('th')[7].text.strip():
reservestr[i+1].find_all('td')[6].text.strip(),
visitorteam[0].find_all('th')[8].text.strip():
reservestr[i+1].find_all('td')[7].text.strip(),
visitorteam[0].find_all('th')[9].text.strip():
reservestr[i+1].find_all('td')[8].text.strip(),
visitorteam[0].find_all('th')[10].text.strip():
reservestr[i+1].find_all('td')[9].text.strip(),
visitorteam[0].find_all('th')[11].text.strip():
reservestr[i+1].find_all('td')[10].text.strip(),
visitorteam[0].find_all('th')[12].text.strip():
reservestr[i+1].find_all('td')[11].text.strip(),
visitorteam[0].find_all('th')[13].text.strip():
reservestr[i+1].find_all('td')[12].text.strip(),
}
return vrlist
def hsplayers_scrape(url):
rlist = {}
slist = {}
for j in range(len(url)):
print(url[j])
r = requests.get(url[j])
raw_html = r.content
soup = BeautifulSoup(raw_html, 'html.parser')
soup[url[j]] = BeautifulSoup(raw_html, 'html.parser')
boxscore = soup[url[j]].find_all('article',
{'class': 'game-boxscore bkb clearfix'})
players = boxscore[0].find_all('div',
{'class': 'player-stats'})
team1 = players[0].find_all('div',
{'class': 'stats-wrap clearfix'})
visitorteam = team1[0].find_all('div',
{'class': 'stats-box full lineup visitor clearfix'})
team2 = players[0].find_all('div',
{'class': 'stats-wrap clearfix'})[1]
hometeam = team2.find_all('div',
{'class': 'stats-box full lineup home clearfix'})
hometbody = hometeam[0].find_all('tbody')
hometr = hometbody[1].find_all('tr')
for k in range(len(hometeam[0].find_all('tbody'))):
if hometeam[0].find_all('tbody')[k].
tr.text.strip() == str('STARTERS'):
starters = hometeam[0].find_all('tbody')[k]
starterstr = starters.find_all('tr')
elif hometeam[0].find_all('tbody')[k].
tr.text.strip() == str('RESERVES'):
reserves = hometeam[0].find_all('tbody')[k]
reservestr = reserves.find_all('tr')
slist[url[j]] = {}
if len(starterstr) > 4:
for i in range((len(starters.find_all('th'))) - 1):
slist[url[j],i] = {
'Home': hometeam[0].caption.text.strip(),
hometeam[0].thead.th.text.strip():
starters.find_all('th')[i + 1].text.strip(),
hometeam[0].find_all('th')[1].text.strip():
starterstr[i + 1].td.text.strip(),
hometeam[0].find_all('th')[2].text.strip():
starterstr[i + 1].find_all('td')[1].text.strip(),
hometeam[0].find_all('th')[3].text.strip():
starterstr[i + 1].find_all('td')[2].text.strip(),
hometeam[0].find_all('th')[4].text.strip():
starterstr[i + 1].find_all('td')[3].text.strip(),
hometeam[0].find_all('th')[5].text.strip():
starterstr[i + 1].find_all('td')[4].text.strip(),
hometeam[0].find_all('th')[6].text.strip():
starterstr[i + 1].find_all('td')[5].text.strip(),
hometeam[0].find_all('th')[7].text.strip():
starterstr[i + 1].find_all('td')[6].text.strip(),
hometeam[0].find_all('th')[8].text.strip():
starterstr[i + 1].find_all('td')[7].text.strip(),
hometeam[0].find_all('th')[9].text.strip():
starterstr[i + 1].find_all('td')[8].text.strip(),
hometeam[0].find_all('th')[10].text.strip():
starterstr[i + 1].find_all('td')[9].text.strip(),
hometeam[0].find_all('th')[11].text.strip():
starterstr[i + 1].find_all('td')[10].text.strip(),
hometeam[0].find_all('th')[12].text.strip():
starterstr[i + 1].find_all('td')[11].text.strip(),
hometeam[0].find_all('th')[13].text.strip():
starterstr[i + 1].find_all('td')[12].text.strip(),
}
return slist
def hrplayers_scrape(url):
rlist = {}
slist = {}
for j in range(len(url)):
print(url[j])
r = requests.get(url[j])
raw_html = r.content
soup = BeautifulSoup(raw_html, 'html.parser')
soup[url[j]] = BeautifulSoup(raw_html, 'html.parser')
boxscore = soup[url[j]].find_all('article',
{'class': 'game-boxscore bkb clearfix'})
players = boxscore[0].find_all('div',
{'class': 'player-stats'})
team1 = players[0].find_all('div',
{'class': 'stats-wrap clearfix'})
visitorteam = team1[0].find_all('div',
{'class': 'stats-box full lineup visitor clearfix'})
team2 = players[0].find_all('div',
{'class': 'stats-wrap clearfix'})[1]
hometeam = team2.find_all('div', {'class':
'stats-box full lineup home clearfix'})
hometbody = hometeam[0].find_all('tbody')
hometr = hometbody[1].find_all('tr')
for k in range(len(hometeam[0].find_all('tbody'))):
if hometeam[0].find_all('tbody')[k].
tr.text.strip() == str('STARTERS'):
starters = hometeam[0].find_all('tbody')[k]
starterstr = starters.find_all('tr')
elif hometeam[0].find_all('tbody')[k].
tr.text.strip() == str('RESERVES'):
reserves = hometeam[0].find_all('tbody')[k]
reservestr = reserves.find_all('tr')
rlist[url[j]] = {}
if len(reservestr) > 0:
for i in range((len(reserves.find_all('th'))) - 1):
rlist[url[j], i] = {
'Home': hometeam[0].caption.text.strip(),
hometeam[0].thead.th.text.strip():
reserves.find_all('th')[i + 1].text.strip(),
hometeam[0].find_all('th')[1].text.strip():
reservestr[i + 1].td.text.strip(),
hometeam[0].find_all('th')[2].text.strip():
reservestr[i + 1].find_all('td')[1].text.strip(),
hometeam[0].find_all('th')[3].text.strip():
reservestr[i + 1].find_all('td')[2].text.strip(),
hometeam[0].find_all('th')[4].text.strip():
reservestr[i + 1].find_all('td')[3].text.strip(),
hometeam[0].find_all('th')[5].text.strip():
reservestr[i + 1].find_all('td')[4].text.strip(),
hometeam[0].find_all('th')[6].text.strip():
reservestr[i + 1].find_all('td')[5].text.strip(),
hometeam[0].find_all('th')[7].text.strip():
reservestr[i + 1].find_all('td')[6].text.strip(),
hometeam[0].find_all('th')[8].text.strip():
reservestr[i + 1].find_all('td')[7].text.strip(),
hometeam[0].find_all('th')[9].text.strip():
reservestr[i + 1].find_all('td')[8].text.strip(),
hometeam[0].find_all('th')[10].text.strip():
reservestr[i + 1].find_all('td')[9].text.strip(),
hometeam[0].find_all('th')[11].text.strip():
reservestr[i + 1].find_all('td')[10].text.strip(),
hometeam[0].find_all('th')[12].text.strip():
reservestr[i + 1].find_all('td')[11].text.strip(),
hometeam[0].find_all('th')[13].text.strip():
reservestr[i + 1].find_all('td')[12].text.strip(),
}
return rlist
q = get_links()
b = hrplayers_scrape(q)
c = hsplayers_scrape(q)
d = vsplayers_scrape(q)
e = vrplayers_scrape(q)
df1 = pd.DataFrame(b)
df1 = df1.T
df1 = df1.replace('\-', ' -- ', regex=True).astype(object)
df1 = df1.replace('\\n', '', regex=True).astype(object)
df1.to_csv('home reserves.csv',header = True)
df2 = pd.DataFrame(c)
df2 = df2.T
df2 = df2.replace('\-', ' -- ', regex=True).astype(object)
df2 = df2.replace('\\n', '', regex=True).astype(object)
df2.to_csv('home starters.csv',header = True)
df3 = pd.DataFrame(d)
df3 = df3.T
df3 = df3.replace('\-', ' -- ', regex=True).astype(object)
df3 = df3.replace('\\n', '', regex=True).astype(object)
df3.to_csv('visitors starters.csv',header = True)
df4 = pd.DataFrame(e)
df4 = df4.T
df4 = df4.replace('\-', ' -- ', regex=True).astype(object)
df4 = df4.replace('\\n', '', regex=True).astype(object)
df4.to_csv('visitors reserves.csv',header = True)
concat = pd.concat([df2,df3,df1,df4],sort=False)
df1.columns
concat.to_csv('player_data.csv',header=True)
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import time
import os
import sys
browser = webdriver.Chrome(os.path.join(sys.path[0], 'chromedriver'))
def login():
login_url = 'https://www.synergysportstech.com/Synergy/Default.aspx'
browser.get(login_url)
username1 = browser.find_element_by_css_selector('#txtUserName')
username = "************"
username1.send_keys(username)
password1 = browser.find_element_by_css_selector('#txtPassword')
password = "************"
password1.send_keys(password)
browser.find_element_by_css_selector('#btnLogin').click()
def get_links():
browser.get('https://www.synergysportstech.com/Synergy/Sport/Basketball/web/teamsst/Video/SelectGame2.aspx')
el5 = browser.find_element_by_css_selector
('#ctl00_MainContent_lstSeason')
for option in el5.find_elements_by_tag_name('option'):
if option.text == '2014 - 2015':
option.click() # select() in earlier versions of webdriver
break
time.sleep(3)
el2 = browser.find_element_by_css_selector
('#ctl00_MainContent_lstDivisionGroup')
for option in el2.find_elements_by_tag_name('option'):
if option.text == 'U Sports':
option.click() # select() in earlier versions of webdriver
break
el4 = browser.find_element_by_css_selector
('#ctl00_MainContent_lstViewMax')
for option in el4.find_elements_by_tag_name('option'):
if option.text == '1600':
option.click() # select() in earlier versions of webdriver
break
el = browser.find_element_by_css_selector
('#ctl00_MainContent_lstSubType')
for option in el.find_elements_by_tag_name('option'):
if option.text == 'Regular Season':
option.click() # select() in earlier versions of webdriver
break
time.sleep(5)
el3 = browser.find_element_by_css_selector
('#ctl00_MainContent_lstDivisions')
for option in el3.find_elements_by_tag_name('option'):
if option.text == 'Ontario University Athletics':
option.click() # select() in earlier versions of webdriver
break
time.sleep(5)
links = browser.find_elements_by_tag_name('table')
html = links[2].get_attribute('innerHTML')
soup1 = BeautifulSoup(html, 'html.parser')
href_list1 = soup1.find_all('a')
i = 0
url_list = []
root_url = 'https://www.synergysportstech.com/Synergy/Sport/Basketball/web/teamsst/Video/'
for link in href_list1:
if "GameGrid2" in link['href']:
url_list.append(root_url + link['href'])
return url_list
def scrape(url_list):
dict = {}
for url in url_list:
dict[url] = {}
browser.get(url)
browser.find_element_by_link_text('Game Breakdown').click()
time.sleep(5)
table = browser.find_elements_by_class_name('Tier')
raw_html = table[2].get_attribute('innerHTML')
soup = BeautifulSoup(raw_html, 'html.parser')
raw_html2 = table[0].get_attribute('innerHTML')
soup2 = BeautifulSoup(raw_html2, 'html.parser')
print(soup.tr)
tr = soup2.find_all('tr')
Away_Team = tr[1].td.text.strip()
Away_Total_Score = tr[1].find_all('td')[1].text.strip()
Home_Team = tr[2].td.text.strip()
Home_Total_Score = tr[2].find_all('td')[1].text.strip()
# team1 = soup.find_all('td')[7].text.strip()
# team2 = soup.find_all('td')[8].text.strip()
tierrow = soup.find_all('tr', {'class': 'TierRow'})
dict[url][Home_Team] = {}
dict[url][Away_Team] = {}
for i in range(len(tierrow)):
row = soup.find_all('tr', {'class': 'TierRow'})[i]
rowname = row.find_all('td')[0].text.strip()
dict[url][Home_Team][rowname] = row.find_all('td')[1].text.strip()
dict[url][Home_Team]['Total Points'] = Home_Total_Score
dict[url][Away_Team][rowname] = row.find_all('td')[2].text.strip()
dict[url][Away_Team]['Total Points'] = Away_Total_Score
if int(Home_Total_Score) > int(Away_Total_Score):
dict[url][Home_Team]['Winner'] = 1
elif Away_Total_Score > Home_Total_Score:
dict[url][Away_Team]['Winner'] = 1
return dict
if __name__ == '__main__':
login()
urls = get_links()
dict = scrape(urls)
df = pd.DataFrame.from_dict({(i,j): dict[i][j]
for i in dict.keys()
for j in dict[i].keys()})
df = df.T
df.to_csv('General2014-15 .csv')
df.fillna(0)